Class 2: Regression Models

by Cory Lanker, LLNL statistician

June 27, 2017, 10:00-11:30 in B543 R1001

This course is a four-class introduction to predictive modeling. It is for all skill levels and will use R statistics software. But as the course is presented with R notebooks, all the code is here. In this class, I'll discuss the different prediction models and how to code them in R. Tomorrow (at 2:00) we'll discuss the categorical response parallel to today's material: classification. The last class (Thurs) is a prediction contest where teams will use this material in a competition to get the best prediction values.

See the Class 1 notes for assistance installing Jupyter and Anaconda.

In this class:

  • Linear Regression: regression assessment and linear regression models
  • Nonlinear Regression: a survey of nonlinear regression models in R
  • Regression Trees: tree regression models including random forests and boosted trees
  • Example Data Set: comparing the various regression models when predicting concrete performance

This work performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract DE-AC52-07NA27344. LLNL-PRES-733567

References for the course

This course is structured around Max Kuhn and Kjell Johnson's master's-level text on prediction modeling. The code is either from the book, especially when using their wonderful caret R package, or it's code I wrote. Their code is disseminated under their copyright. All data sets in this course are all publicly available.

References

  1. Breiman, Leo. "Statistical modeling: The two cultures." Statistical science 16.3 (2001): 199-231.
  2. Clarke, Bertrand, Ernest Fokoue, and Hao Helen Zhang. Principles and theory for data mining and machine learning. New York: Springer, 2009.
  3. Friedman, Jerome, Trevor Hastie, and Robert Tibshirani. The elements of statistical learning. New York: Springer, 2001.
  4. James, Gareth, et al. An introduction to statistical learning. New York: Springer, 2013.
  5. Kuhn, Max, and Kjell Johnson. Applied predictive modeling. New York: Springer, 2013.
  6. Venables, William N., and Brian D. Ripley. Modern applied statistics with S. New York: Springer, 2002.

Source functions and install libraries

In [223]:
# functions for this page
plotsize <- function(wd,ht){options(repr.plot.width=wd, repr.plot.height=ht)}
packagelist <- c("MASS", "FNN", "ellipse", "AppliedPredictiveModeling", "Hmisc", 
                 "missForest", "caret", "corrplot", "earth", "lattice", "e1071",
                 "elasticnet", "pls", "doMC", "kernlab", "nnet", "rpart", "ipred",
                 "Cubist", "gbm", "party", "partykit", "randomForest", "car",
                 "plyr", "proxy", "RWeka")
# It's likely that RWeka won't load on LLNL machines... that package isn't necessary.
output <- sapply(packagelist, require, character.only = TRUE)
if (any(!output)){
    cat('Installing packages: ', packagelist[!output],' ...\n')
    install.packages(packagelist[!output])
    output <- sapply(packagelist[!output], require, character.only = TRUE)
}
# sessionInfo()
Loading required package: RWeka
Installing packages:  RWeka  ...
Installing package into ‘/Users/lanker1/Library/R/3.3/library’
(as ‘lib’ is unspecified)
  There is a binary version available but the source version is later:
      binary source needs_compilation
RWeka 0.4-26 0.4-34             FALSE

installing the source package ‘RWeka’

Warning message in download.file(url, destfile, method, mode = "wb", ...):
“URL 'https://cran.r-project.org/src/contrib/RWeka_0.4-34.tar.gz': status was 'Couldn't resolve host name'”Warning message in download.packages(pkgs, destdir = tmpd, available = available, :
“download of package ‘RWeka’ failed”Loading required package: RWeka

Part 5. Assessing regression models

A common metric to assess performance of a regression model (predicts continuous $y$) is to calculate the mean squared error (MSE) between the predictions and the true values: $$MSE = \frac1n\sum_{i=1}^n (\hat{y}_i - y_i)^2.$$

MSE penalizes errors according to the square of the deviation. This means if you are off by 0.1 for each of 100 observations, you have the same MSE if you get 99 exactly correct but off by 1 on the remaining observation: $$\frac1{100} \sum_{i=1}^{100} (0.1)^2 = \frac1{100}\, (100)(.01) = \frac1{100} \left( \sum_{i=1}^{99} (0)^2 + \sum_{i=100}^{100} (1)^2\right).$$ The average absolute error or AAE ($\sum |\hat{y}-y|$) is 0.1 and 0.01, respectively, but the MSE is the same.

The optimizer for MSE is to predict the expected mean $E(Y|X)$, while the optimizer for AAE is to predict the expected median of the distribution

  • if you're not at the median, then you can reduce AAE by moving towards the more populated side

MSE is often reported using its square root, RMSE.

Solubility data

This class will use a data set on solubility data. A drug's solubility plays a factor in whether medications can be administered orally. The data set consists of 1267 compounds and

In [224]:
### R code from Applied Predictive Modeling (2013) by Kuhn and Johnson.
### Copyright 2013 Kuhn and Johnson
### Web Page: http://www.appliedpredictivemodeling.com
### Contact: Max Kuhn (mxkuhn@gmail.com)
###
### Chapter 6: Linear Regression and Its Cousins
###
### Required packages: AppliedPredictiveModeling, lattice, corrplot, pls, 
###                    elasticnet,
###
### Data used: The solubility from the AppliedPredictiveModeling package

plotsize(6,3)
data(solubility)
cat(dim(solTrainX),"::",length(solTrainY),"::",dim(solTestX),"::",length(solTestY),'\n')
951 228 :: 951 :: 316 228 :: 316 
In [225]:
col.factor = sapply(solTrainX[1,],is.factor)
table(col.factor)
plotsize(6,3)
col.unique = sapply(apply(solTrainX,2,unique),length)
cat('There are', min(which(col.unique > 2))-1,'binary predictors (fingerprints). The rest are called descriptors.')
col.factor
FALSE 
  228 
There are 208 binary predictors (fingerprints). The rest are called descriptors.

The other 20 predictors are descriptors: 16 count variables (e.g., number of bond types or certain atoms) and 4 continuous predictors.

In [226]:
tabulate(colSums(is.na(rbind(solTrainX, solTestX))))
sum(is.na(solTrainY))
sum(col.unique[1:208] != 2)
0
0
0
In [227]:
plotsize(7,3)
plot(temp <- colMeans(solTrainX[,1:208]),ylab='proportion',ylim=c(0,1),pch=3)
min(temp)
abline(h=0,col='blue')
plotsize(7,9)
par(mfrow=c(4,1), las=1, mai=c(.25,1.25,.25,.125))
boxplot(solTrainX[,209], horizontal=T, main=colnames(solTrainX)[209])
boxplot(solTrainX[,210:224], horizontal=T)
boxplot(solTrainX[,225:226], horizontal=T)
boxplot(solTrainX[,227:228], horizontal=T)
0.0378548895899054
In [228]:
head(solTrainX[,209:228])
MolWeightNumAtomsNumNonHAtomsNumBondsNumNonHBondsNumMultBondsNumRotBondsNumDblBondsNumAromaticBondsNumHydrogenNumCarbonNumNitrogenNumOxygenNumSulferNumChlorineNumHalogenNumRingsHydrophilicFactorSurfaceArea1SurfaceArea2
661208.2828 16 30 18 16 0 0 16 12 14 2 0 0 0 0 3 -0.85625.78 25.78
662365.5449 26 52 29 13 4 0 12 23 21 3 1 1 0 0 4 -0.37052.19 80.43
663206.3133 15 33 15 7 4 1 6 18 13 0 2 0 0 0 1 -0.33037.30 37.30
665136.2626 10 26 10 2 1 2 0 16 10 0 0 0 0 0 1 -0.960 0.00 0.00
668229.7531 15 31 15 6 5 0 6 16 9 5 0 0 1 1 1 -0.06953.94 53.94
669270.2532 15 31 14 2 5 2 0 17 10 1 1 1 2 2 0 -0.65120.31 45.61

Columns 1-208 are binary 0/1, Columns 209 (MolWt), 226:228 (Hydro, SA1, SA2) are continuous, Columns 210:225 are 16 count descriptors.

Constant prediction (MSE and AAE minimizers)

Let $y$ be $\log(\text{solubility})$. The range, mean, and median are, with a boxplot:

In [229]:
y = solTrainY
cat("range=",range(y)," mean=",mean(y)," median=",median(y))
range= -11.62 1.58  mean= -2.71857  median= -2.51
In [230]:
plotsize(7,2.5)
boxplot(solTrainY, horizontal=TRUE, main="Boxplot of log(Solubility) values")
In [231]:
# show plot of MSE versus constant value, and minimum at mean(Y)
xs = seq(-3.5,-1.75,,176)
mse = sapply(xs, function(x) (1/length(y))*sum((y - x)^2))
# show plot of AAE versus constant value, and minimum at median(Y)
aae = sapply(xs, function(x) (1/length(y))*sum(abs(y-x)))
plotsize(8,3.5)
par(mfrow=c(1,2))
plot(xs,mse,type='l',lwd=2,main='MSE vs. constant',xlab='constant')
abline(v=mean(y),col='red',lwd=2)
abline(v=median(y),col='blue',lwd=2)
legend('topleft',c('mean','median'),lwd=2,col=c(2,4),cex=.75,y.intersp=2.5,box.lwd=.25)
plot(xs,aae,type='l',lwd=2,main='AAE vs. constant',xlab='constant')
abline(v=mean(y),col='red',lwd=2)
abline(v=median(y),col='blue',lwd=2)
In [232]:
# show RMSE calculation using mean value
cat("The best training RMSE using a constant prediction value is",
   rmse <- sqrt( (1/length(y))*sum((y-mean(y))^2)))
The best training RMSE using a constant prediction value is 2.045565

We may get a sense about RMSE if we show the median $\pm$ RMSE on the boxplot of log(Solubility) values:

In [233]:
plotsize(7,3)
hist(y, 30, main="Mean log(Solubility) +/- RMSE")
abline(v=mean(y),lwd=4,col='darkorange')
abline(v=mean(y)+c(-1,1)*rmse,lwd=3,lty=2,col='darkorange')

Pausing to consider our goal

Our goal is to predict solubility (continuous) based on the information in the 208 binary/16 count/4 continuous predictors. Only a minuscule portion of the predictor space is populated with our $\mathbf{X}$ observations, and it's only at those locations that we know a $y$. How to find a function of $\mathbf{X}$ approximating $E(y|X)$ (thus minimizing the mean squared error) given such limited data?

Considering the space $\{0,1\}^{208}$

There are $2^{208}$ or $4.1 \times 10^{62}$ possible arrangements of these 208 binary variables.

Modeling will only work if there are strong influences on $y$ from the individual values and can consider the predictors separately. There are only $2^{208}$ possibilities of how $y$ is affected if each 208-level interaction is important.

For example, we could have a case of two binary predictors such if $x_1$ or $x_2$ is one, $y$ is likely to be high, if $x_1$ and $x_2$ are both zero, then $y$ takes a low value. In this case, there isn't an interaction effect between $x_1$ and $x_2$ so we can consider the variables separately when forming a prediction for $y$. However, if in addition, when $x_1$ and $x_2$ are both one then $y$ is low, now we must consider all four possibilities of the $x_1, x_2$ space.

In [234]:
solTemp <- solTrainX[,1:208]
solTemp$Y <- solTrainY
solCorr <- cor(solTemp[,c(209,1:208)])

a <- which.max(solCorr[1,2:209])
cat(a, solCorr[1,a+1])
plotsize(7,3.5)
plot(solCorr[1,2:209], ylab='Correlation with y')
abline(h=0,col=3)
plotsize(8,8)
corrplot::corrplot(solCorr, tl.cex=.3, order="hclust")  # try running with and without hclust
#corrplot::corrplot(solCorr, tl.cex=.3)
72 0.3366402
In [235]:
solTemp <- solTrainX[,209:228]
solTemp$Y <- solTrainY
solCorr <- cor(solTemp[,c(21,1:20)])

a <- which.max(solCorr[1,2:21])
cat(a, solCorr[1,a+1])
plotsize(7,3.5)
plot(solCorr[1,2:21], ylab='Correlation with y')
abline(h=0,col=3)
18 0.3090222

There looks to be a lot of predictive information in these variables.

solTrainX vs. solTestX

Our task is to predict solTestX data. While they will not span the exact same space at the training data, it should be close, at least in the ballpark. Let's check the univariate distributions as a basic sanity test.

In [236]:
plotsize(7,6)
plot(colMeans(solTrainX[,1:208]),ylab='proportion',ylim=c(0,1),pch=3)
points(temp <- colMeans(solTestX[,1:208]),ylab='proportion',col=2,ylim=c(0,1),pch=3)
min(temp)
abline(h=0,col='blue')
sum(colMeans(solTrainX[,1:208]) > colMeans(solTestX[,1:208]))/208
0.0316455696202532
0.668269230769231

Hmm... 67% of the binary predictors have more 1's in the training data. A red flag to potential overfitting.

The statistical test:

In [237]:
cat("p-value of similarity is",2*(1-pbinom(0.5 + 0.668*208, 208, 0.5)))
p-value of similarity is 6.666165e-07
In [238]:
rv
FP001FP002FP003FP004FP005FP006FP007FP008FP009FP010NumCarbonNumNitrogenNumOxygenNumSulferNumChlorineNumHalogenNumRingsHydrophilicFactorSurfaceArea1SurfaceArea2
0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 -0.986 0.00 0.00
1 1 1 1 1 1 1 1 1 1 33 6 13 4 10 10 7 13.483331.94331.94
In [239]:
rv = apply(rbind(solTestX, solTrainX), 2, range)
plotsize(7,9)
par(mfrow=c(4,1), las=1, mai=c(.25,1.25,.25,.125))
boxplot(solTrainX[,209], horizontal=T, main="Training: MolWt", ylim=rv[,209])
boxplot(solTestX[,209], horizontal=T, main="Testing: MolWt", ylim=rv[,209])
is.range = 210:224; yrange = c(min(rv[1,is.range]), max(rv[2,is.range]))
boxplot(solTrainX[,is.range], horizontal=T, main="Training data", ylim=yrange)
boxplot(solTestX[,is.range], horizontal=T, main="Test data", ylim=yrange)
is.range = 225:226; yrange = c(min(rv[1,is.range]), max(rv[2,is.range]))
boxplot(solTrainX[,is.range], horizontal=T, main="Training data", ylim=yrange)
boxplot(solTestX[,is.range], horizontal=T, main="Test data", ylim=yrange)
is.range = 227:228; yrange = c(min(rv[1,is.range]), max(rv[2,is.range]))
boxplot(solTrainX[,is.range], horizontal=T, main="Training data", ylim=yrange)
boxplot(solTestX[,is.range], horizontal=T, main="Test data", ylim=yrange)

Data Exploration

In [240]:
# (C) Kuhn and Johnson (2013)
################################################################################
### Section 6.1 Case Study: Quantitative Structure- Activity
### Relationship Modeling


### Some initial plots of the data
plotsize(6,4)
xyplot(solTrainY ~ solTrainX$MolWeight, type = c("p", "g"),
       ylab = "Solubility (log)",
       main = "(a)",
       xlab = "Molecular Weight")
xyplot(solTrainY ~ solTrainX$NumRotBonds, type = c("p", "g"),
       ylab = "Solubility (log)",
       xlab = "Number of Rotatable Bonds")
bwplot(solTrainY ~ ifelse(solTrainX[,100] == 1, 
                          "structure present", 
                          "structure absent"),
       ylab = "Solubility (log)",
       main = "(b)",
       horizontal = FALSE)
In [241]:
# (C) Kuhn and Johnson (2013)
### Find the columns that are not fingerprints (i.e. the continuous
### predictors). grep will return a list of integers corresponding to
### column names that contain the pattern "FP".

isFingerprints <- grep("FP", names(solTrainXtrans))
notFingerprints <- setdiff(1:ncol(solTrainXtrans),isFingerprints)
plotsize(8,5)
featurePlot(solTrainXtrans[, notFingerprints[c(1:7,9:11,13,17)]],
            solTrainY,
            between = list(x = 1, y = 1),
            type = c("g", "p", "smooth"),
            labels = rep("", 2))
In [242]:
notFingerprints
  1. 209
  2. 210
  3. 211
  4. 212
  5. 213
  6. 214
  7. 215
  8. 216
  9. 217
  10. 218
  11. 219
  12. 220
  13. 221
  14. 222
  15. 223
  16. 224
  17. 225
  18. 226
  19. 227
  20. 228
In [243]:
# (C) Kuhn and Johnson (2013)
### We used the full namespace to call this function because the pls
### package (also used in this chapter) has a function with the same
### name.
plotsize(6,6)
corrplot::corrplot(cor(solTrainXtrans[, notFingerprints]), 
                   order = "hclust", 
                   tl.cex = .8)

Ordinary least squares

Linear Regression employs the least squares solution $\mathbf{b} = (X'X)^{-1} X'y$. Notice that formula includes $(X'X)^{-1}$, proportional to the covariance matrix of $\mathbf{X}$. This particular calculation is sensitive to multicollinearity, inducing variability of the estimates in $\mathbf{b}$. We can use the Variance Inflation Factor (VIF) to monitor potential problems with the matrix we use to calculate the linear regression estimate $\mathbf{b}$.

Note: with limited data relative to the number of predictors $p$ (which includes all data sets today), influential observations greatly affect $\mathbf{b}$. We can try a robust regression technique to see if we get different results (e.g., Huber residuals that use a squared penalty up to a maxium penalty value).

In [311]:
# (C) Kuhn and Johnson, 2013
################################################################################
### Section 6.2 Linear Regression

### Create a control function that will be used across models. We
### create the fold assignments explicitly instead of relying on the
### random number seed being set to identical values.

set.seed(100)
indx <- createFolds(solTrainY, returnTrain = TRUE, k=4)  # for speed purposes, I use 4-fold CV
str(indx)
ctrl <- trainControl(method = "cv", index = indx)
ctrl$number = length(ctrl$index)  # for some reason, this isn't set correctly in the function trainControl
str(ctrl)
List of 4
 $ Fold1: int [1:714] 1 2 3 4 6 7 8 12 14 15 ...
 $ Fold2: int [1:712] 1 2 3 4 5 7 8 9 10 11 ...
 $ Fold3: int [1:713] 1 3 5 6 7 9 10 11 13 14 ...
 $ Fold4: int [1:714] 2 4 5 6 8 9 10 11 12 13 ...
List of 27
 $ method           : chr "cv"
 $ number           : int 4
 $ repeats          : num 1
 $ search           : chr "grid"
 $ p                : num 0.75
 $ initialWindow    : NULL
 $ horizon          : num 1
 $ fixedWindow      : logi TRUE
 $ skip             : num 0
 $ verboseIter      : logi FALSE
 $ returnData       : logi TRUE
 $ returnResamp     : chr "final"
 $ savePredictions  : logi FALSE
 $ classProbs       : logi FALSE
 $ summaryFunction  :function (data, lev = NULL, model = NULL)  
 $ selectionFunction: chr "best"
 $ preProcOptions   :List of 6
  ..$ thresh   : num 0.95
  ..$ ICAcomp  : num 3
  ..$ k        : num 5
  ..$ freqCut  : num 19
  ..$ uniqueCut: num 10
  ..$ cutoff   : num 0.9
 $ sampling         : NULL
 $ index            :List of 4
  ..$ Fold1: int [1:714] 1 2 3 4 6 7 8 12 14 15 ...
  ..$ Fold2: int [1:712] 1 2 3 4 5 7 8 9 10 11 ...
  ..$ Fold3: int [1:713] 1 3 5 6 7 9 10 11 13 14 ...
  ..$ Fold4: int [1:714] 2 4 5 6 8 9 10 11 12 13 ...
 $ indexOut         : NULL
 $ indexFinal       : NULL
 $ timingSamps      : num 0
 $ predictionBounds : logi [1:2] FALSE FALSE
 $ seeds            : logi NA
 $ adaptive         :List of 4
  ..$ min     : num 5
  ..$ alpha   : num 0.05
  ..$ method  : chr "gls"
  ..$ complete: logi TRUE
 $ trim             : logi FALSE
 $ allowParallel    : logi TRUE
In [300]:
set.seed(100)
indx20 <- createFolds(solTrainY, returnTrain = TRUE, k=20)
ctrl20 <- trainControl(method = "cv", index = indx20)

Number of folds

We're using 10 folds in CV (ideally, still too long for some models, so we use 4-fold CV). Some methods take longer to train. You could reduce the nubmer of folds to the 3 to 5 range and perhaps not suffer too much degradation of CV tuning parameter estimates.

In [302]:
# (C) Kuhn and Johnson, 2013
### Linear regression model with all of the predictors. This will
### produce some warnings that a 'rank-deficient fit may be
### misleading'. This is related to the predictors being so highly
### correlated that some of the math has broken down.

set.seed(100)
lmTune0 <- train(x = solTrainXtrans, y = solTrainY,
                 method = "lm",
                 trControl = ctrl)
lmTune0
Warning message in predict.lm(modelFit, newdata):
“prediction from a rank-deficient fit may be misleading”
Linear Regression 

951 samples
228 predictors

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 714, 712, 713, 714 
Resampling results:

  RMSE       Rsquared 
  0.7632294  0.8649296

Tuning parameter 'intercept' was held constant at a value of TRUE
In [304]:
# the danger of 4-fold CV:
lmTune0$resample
RMSERsquaredResample
0.81324590.8408778Fold1
0.73796320.8808092Fold2
0.77007110.8627116Fold3
0.73163750.8753199Fold4
In [247]:
# what does (X'X)^-1 look like?
plotx = as.matrix(solTrainX)
ev = eigen(t(plotx) %*% plotx)
str(ev)
plotsize(7,4)
plot(log10(ev$values))
ev$values[220:228]
List of 2
 $ values : num [1:228] 53433564 1867376 194393 76618 51322 ...
 $ vectors: num [1:228, 1:228] -0.00215 -0.00228 -0.00164 -0.00251 -0.00245 ...
Warning message in plot(log10(ev$values)):
“NaNs produced”
  1. 0.601588679795535
  2. 0.576384020973424
  3. 0.549631833191047
  4. 0.329219162741621
  5. 0.311144465455357
  6. 0.248932631321701
  7. 0.165489770553851
  8. 3.3909438646295e-11
  9. -3.55814945672088e-12

There are two problems causing the tiny eigenvalues: (1) NumAtoms/NumBonds, (2) NonHAtoms/Bonds

In [248]:
cor(solTrainX$NumAtoms, solTrainX$NumBonds)
cor(solTrainX$NumNonHAtoms, solTrainX$NumNonHBonds)
0.997228307027085
0.99448365161198
In [249]:
col.remove = (colnames(solTrainX) %in% c('NumBonds', 'NumNonHBonds'))
which(col.remove)
  1. 212
  2. 213
In [250]:
# (C) Kuhn and Johnson, 2013
### And another using a set of predictors reduced by unsupervised
### filtering. We apply a filter to reduce extreme between-predictor
### correlations. Note the lack of warnings.

tooHigh <- findCorrelation(cor(solTrainXtrans), .95)
tooHigh
  1. 14
  2. 33
  3. 141
  4. 166
  5. 167
  6. 175
  7. 183
  8. 208
  9. 212
  10. 213
  11. 217
  12. 220
  13. 223
  14. 224
  15. 228
  16. 19
  17. 1
  18. 25
  19. 42
  20. 58
  21. 134

Now running regression on solTrainXtrans, we don't have multicollinearity warnings:

In [251]:
trainXfiltered <- solTrainXtrans[, -tooHigh]
testXfiltered  <-  solTestXtrans[, -tooHigh]

set.seed(100)
lmTune <- train(x = trainXfiltered, y = solTrainY,
                method = "lm",
                trControl = ctrl)

lmTune

### Save the test set results in a data frame                 
testResults <- data.frame(obs = solTestY,
                          Linear_Regression = predict(lmTune, testXfiltered))
Linear Regression 

951 samples
207 predictors

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 714, 712, 713, 714 
Resampling results:

  RMSE      Rsquared 
  0.767721  0.8625358

Tuning parameter 'intercept' was held constant at a value of TRUE
In [252]:
summary(lmTune)

# The resampling Rsquared was 0.87 while the full model 
#     training Rsquared is much higher at 0.94
Call:
lm(formula = .outcome ~ ., data = dat)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.88083 -0.30042  0.00874  0.31417  1.64368 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)        9.531279   1.246753   7.645 6.47e-14 ***
FP002              0.402416   0.262152   1.535 0.125198    
FP003             -0.145813   0.134925  -1.081 0.280185    
FP004             -0.195805   0.136393  -1.436 0.151539    
FP005             -0.707129   0.355772  -1.988 0.047224 *  
FP006             -0.308994   0.203822  -1.516 0.129946    
FP007              0.037501   0.118122   0.317 0.750969    
FP008              0.032480   0.161738   0.201 0.840895    
FP009             -0.027495   0.327329  -0.084 0.933080    
FP010              0.553851   0.296433   1.868 0.062102 .  
FP011              0.446775   0.208692   2.141 0.032612 *  
FP012             -0.156092   0.160851  -0.970 0.332157    
FP013             -0.179640   0.366939  -0.490 0.624588    
FP015             -0.205403   0.149741  -1.372 0.170566    
FP016             -0.148655   0.157311  -0.945 0.344979    
FP017             -0.168261   0.143145  -1.175 0.240190    
FP018             -0.283813   0.234909  -1.208 0.227362    
FP020              0.312769   0.215570   1.451 0.147232    
FP021             -0.196191   0.267782  -0.733 0.464003    
FP022              0.086329   0.338917   0.255 0.799009    
FP023             -0.414406   0.190752  -2.172 0.030134 *  
FP024             -0.383314   0.222848  -1.720 0.085837 .  
FP026              0.206458   0.258409   0.799 0.424569    
FP027              0.192492   0.243084   0.792 0.428687    
FP028              0.128069   0.124948   1.025 0.305705    
FP029             -0.150338   0.218889  -0.687 0.492408    
FP030             -0.373949   0.247873  -1.509 0.131819    
FP031              0.072936   0.135701   0.537 0.591099    
FP032             -0.416977   0.330596  -1.261 0.207601    
FP034             -0.146796   0.263866  -0.556 0.578156    
FP035             -0.245589   0.173697  -1.414 0.157812    
FP036             -0.010168   0.164793  -0.062 0.950818    
FP037              0.142717   0.166778   0.856 0.392423    
FP038              0.225418   0.195589   1.153 0.249482    
FP039             -0.264523   0.288390  -0.917 0.359315    
FP040              0.505803   0.176207   2.871 0.004215 ** 
FP041             -0.274983   0.297194  -0.925 0.355130    
FP043             -0.010806   0.211216  -0.051 0.959213    
FP044             -0.366719   0.221095  -1.659 0.097608 .  
FP045              0.241744   0.132804   1.820 0.069115 .  
FP046             -0.019929   0.177428  -0.112 0.910600    
FP047             -0.029274   0.141640  -0.207 0.836318    
FP048              0.193098   0.188190   1.026 0.305189    
FP049              0.289831   0.194319   1.492 0.136249    
FP050             -0.087140   0.146168  -0.596 0.551247    
FP051              0.052257   0.203722   0.257 0.797628    
FP052             -0.346861   0.257070  -1.349 0.177656    
FP053              0.215510   0.257119   0.838 0.402202    
FP054             -0.127939   0.188575  -0.678 0.497698    
FP055             -0.294173   0.189391  -1.553 0.120787    
FP056             -0.183255   0.299868  -0.611 0.541309    
FP057             -0.119818   0.142598  -0.840 0.401039    
FP059             -0.204067   0.174985  -1.166 0.243908    
FP060              0.263283   0.151011   1.743 0.081666 .  
FP061             -0.491600   0.137319  -3.580 0.000366 ***
FP062             -0.540208   0.276957  -1.951 0.051490 .  
FP063              0.334909   0.235635   1.421 0.155649    
FP064              0.265869   0.118682   2.240 0.025374 *  
FP065             -0.107469   0.120510  -0.892 0.372799    
FP066              0.224140   0.129137   1.736 0.083035 .  
FP067             -0.151236   0.157446  -0.961 0.337087    
FP068              0.424918   0.203797   2.085 0.037409 *  
FP069              0.133692   0.088166   1.516 0.129850    
FP070              0.011380   0.082597   0.138 0.890453    
FP071              0.218002   0.123824   1.761 0.078720 .  
FP072             -0.668332   0.261775  -2.553 0.010876 *  
FP073             -0.590945   0.209965  -2.814 0.005015 ** 
FP074              0.076014   0.118924   0.639 0.522902    
FP075              0.106091   0.104852   1.012 0.311956    
FP076              0.726063   0.170738   4.252 2.38e-05 ***
FP077              0.183112   0.120290   1.522 0.128372    
FP078             -0.390228   0.162672  -2.399 0.016692 *  
FP079              0.625276   0.187041   3.343 0.000870 ***
FP080              0.297643   0.156214   1.905 0.057120 .  
FP081             -0.293340   0.108986  -2.692 0.007272 ** 
FP082              0.273170   0.097349   2.806 0.005146 ** 
FP083             -0.341406   0.206602  -1.652 0.098859 .  
FP084              0.342894   0.240277   1.427 0.153977    
FP085             -0.403105   0.143853  -2.802 0.005208 ** 
FP086             -0.029603   0.096493  -0.307 0.759092    
FP087              0.692471   0.263050   2.632 0.008653 ** 
FP088              0.223237   0.103385   2.159 0.031148 *  
FP089              0.006225   0.175138   0.036 0.971655    
FP090             -0.072323   0.119231  -0.607 0.544318    
FP091              0.273407   0.150673   1.815 0.069993 .  
FP092              0.234776   0.207229   1.133 0.257609    
FP093              0.197212   0.111417   1.770 0.077131 .  
FP094             -0.164714   0.143986  -1.144 0.253010    
FP095             -0.059491   0.114094  -0.521 0.602231    
FP096             -0.409433   0.146340  -2.798 0.005278 ** 
FP097              0.064478   0.161998   0.398 0.690731    
FP098             -0.124774   0.164789  -0.757 0.449188    
FP099              0.117215   0.260121   0.451 0.652395    
FP100             -0.541338   0.311013  -1.741 0.082174 .  
FP101             -0.195071   0.303582  -0.643 0.520705    
FP102              0.006847   0.340132   0.020 0.983945    
FP103             -0.143536   0.093908  -1.528 0.126820    
FP104             -0.172334   0.119963  -1.437 0.151265    
FP105             -0.116879   0.145621  -0.803 0.422447    
FP106              0.180717   0.126959   1.423 0.155032    
FP107             -0.251706   0.173723  -1.449 0.147788    
FP108             -0.044073   0.184718  -0.239 0.811484    
FP109              0.663429   0.221248   2.999 0.002803 ** 
FP110              0.504421   0.332622   1.516 0.129819    
FP111             -0.443722   0.144305  -3.075 0.002183 ** 
FP112             -0.085908   0.268365  -0.320 0.748969    
FP113              0.167217   0.098757   1.693 0.090834 .  
FP114             -0.088856   0.178537  -0.498 0.618851    
FP115             -0.211622   0.140179  -1.510 0.131556    
FP116              0.120799   0.187698   0.644 0.520046    
FP117              0.305764   0.178481   1.713 0.087103 .  
FP118             -0.252360   0.126935  -1.988 0.047166 *  
FP119              0.780114   0.270026   2.889 0.003977 ** 
FP120             -0.255507   0.183475  -1.393 0.164159    
FP121              0.148201   0.405456   0.366 0.714829    
FP122              0.145447   0.107085   1.358 0.174800    
FP123              0.215776   0.178876   1.206 0.228091    
FP124              0.035698   0.172193   0.207 0.835823    
FP125              0.058790   0.117917   0.499 0.618231    
FP126             -0.162938   0.120282  -1.355 0.175948    
FP127             -0.725249   0.171165  -4.237 2.55e-05 ***
FP128             -0.433463   0.195376  -2.219 0.026815 *  
FP129              0.076065   0.228853   0.332 0.739699    
FP130             -1.033193   0.379176  -2.725 0.006585 ** 
FP131              0.292703   0.148601   1.970 0.049241 *  
FP132             -0.448668   0.229532  -1.955 0.050992 .  
FP133             -0.235182   0.124598  -1.888 0.059479 .  
FP135              0.130930   0.135055   0.969 0.332632    
FP136             -0.225833   0.296721  -0.761 0.446842    
FP137             -0.065913   0.298064  -0.221 0.825047    
FP138             -0.028185   0.190014  -0.148 0.882120    
FP139              0.093711   0.400070   0.234 0.814866    
FP140              0.402646   0.221450   1.818 0.069432 .  
FP142              0.572553   0.149489   3.830 0.000139 ***
FP143              0.751869   0.285099   2.637 0.008533 ** 
FP144              0.229748   0.280649   0.819 0.413259    
FP145             -0.031904   0.121025  -0.264 0.792152    
FP146             -0.174765   0.205853  -0.849 0.396165    
FP147              0.204023   0.126350   1.615 0.106790    
FP148             -0.241346   0.133393  -1.809 0.070809 .  
FP149              0.086559   0.167949   0.515 0.606433    
FP150              0.128717   0.144836   0.889 0.374447    
FP151              0.252839   0.314513   0.804 0.421708    
FP152              0.163206   0.269933   0.605 0.545620    
FP153             -0.299763   0.288298  -1.040 0.298787    
FP154             -0.825790   0.200460  -4.119 4.22e-05 ***
FP155              0.471323   0.293419   1.606 0.108629    
FP156             -0.231798   0.345681  -0.671 0.502714    
FP157             -0.490173   0.250299  -1.958 0.050563 .  
FP158              0.012727   0.195001   0.065 0.947980    
FP159              0.434299   0.243198   1.786 0.074542 .  
FP160             -0.241652   0.215711  -1.120 0.262966    
FP161             -0.229587   0.148320  -1.548 0.122067    
FP162             -0.014359   0.183827  -0.078 0.937759    
FP163              0.584452   0.261217   2.237 0.025555 *  
FP164              0.607709   0.192168   3.162 0.001629 ** 
FP165              0.121716   0.145125   0.839 0.401908    
FP168              0.043423   0.182727   0.238 0.812225    
FP169             -0.214017   0.085048  -2.516 0.012065 *  
FP170              0.025670   0.156377   0.164 0.869654    
FP171              0.423581   0.120090   3.527 0.000446 ***
FP172             -0.565437   0.250312  -2.259 0.024177 *  
FP173              0.424650   0.169552   2.505 0.012474 *  
FP174              0.068558   0.194573   0.352 0.724674    
FP176              0.797610   0.253882   3.142 0.001747 ** 
FP177              0.276720   0.242011   1.143 0.253234    
FP178              0.110112   0.213152   0.517 0.605597    
FP179             -0.086228   0.225258  -0.383 0.701980    
FP180             -0.516967   0.308929  -1.673 0.094665 .  
FP181              0.548444   0.331649   1.654 0.098613 .  
FP182             -0.110941   0.136470  -0.813 0.416518    
FP184              0.329269   0.153738   2.142 0.032538 *  
FP185             -0.164309   0.203864  -0.806 0.420516    
FP186             -0.470207   0.196039  -2.399 0.016706 *  
FP187              0.374961   0.273906   1.369 0.171432    
FP188              0.162380   0.130808   1.241 0.214865    
FP189              0.085114   0.353229   0.241 0.809653    
FP190              0.380175   0.313598   1.212 0.225783    
FP191              0.267262   0.146357   1.826 0.068238 .  
FP192             -0.094810   0.275663  -0.344 0.730992    
FP193              0.137365   0.150227   0.914 0.360813    
FP194              0.307570   0.276417   1.113 0.266197    
FP195              0.256627   0.346187   0.741 0.458747    
FP196             -0.245344   0.257562  -0.953 0.341121    
FP197             -0.045079   0.159555  -0.283 0.777618    
FP198              0.290232   0.271008   1.071 0.284546    
FP199             -0.128508   0.184695  -0.696 0.486781    
FP200              0.115328   0.216142   0.534 0.593795    
FP201             -0.361655   0.192668  -1.877 0.060897 .  
FP202              0.480010   0.180908   2.653 0.008140 ** 
FP203              0.113272   0.124181   0.912 0.361983    
FP204             -0.150433   0.263365  -0.571 0.568039    
FP205              0.019767   0.154015   0.128 0.897911    
FP206              0.027126   0.193149   0.140 0.888351    
FP207             -0.061951   0.213985  -0.290 0.772272    
MolWeight         -1.422025   0.231033  -6.155 1.23e-09 ***
NumAtoms          -1.701055   0.792455  -2.147 0.032151 *  
NumNonHAtoms      -0.233660   0.756975  -0.309 0.757655    
NumMultBonds      -0.137490   0.118875  -1.157 0.247810    
NumRotBonds       -0.460000   0.133451  -3.447 0.000599 ***
NumDblBonds       -0.166292   0.291971  -0.570 0.569155    
NumHydrogen        0.497291   0.157906   3.149 0.001702 ** 
NumCarbon         -0.774745   0.335955  -2.306 0.021380 *  
NumOxygen          1.493861   0.401794   3.718 0.000216 ***
NumSulfer         -0.627880   0.485229  -1.294 0.196072    
NumRings          -1.887635   0.441071  -4.280 2.12e-05 ***
HydrophilicFactor -0.145171   0.106919  -1.358 0.174950    
SurfaceArea1       0.278697   0.039748   7.012 5.30e-12 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.5806 on 743 degrees of freedom
Multiple R-squared:  0.9371,	Adjusted R-squared:  0.9195 
F-statistic: 53.43 on 207 and 743 DF,  p-value: < 2.2e-16

Stepwise regression

Stepwise regression adds and removes variables based on maximizing likelihood relative to a coefficient estimation penalty. The default uses AIC. To speed up the computation, I select only the highest 1/3 of $|t|$ scores in the regression (using the following code block):

In [253]:
trainingData <- solTrainXtrans
trainingData$Solubility <- solTrainY
lmFull <- lm(Solubility ~ ., data=trainingData)
out <- summary(lmFull)
keyvars <- order(out$coefficients[,3])
keyvars <- c(setdiff(keyvars, 1))   # remove intercept (if present)
trainingData <- trainingData[,keyvars[1:80]-1]
trainingData$Solubility <- solTrainY
In [254]:
lmFull <- lm(Solubility ~ ., data=trainingData)
lmReduced <- step(lmFull, trace=0, k = log(nrow(trainingData)))

An aside: eyeing signficance or randomness in added-variable plots

In the following stepwise regression, FP154 doesn't look like a real effect (just fitting noise) while a variable like FP044 does look significant.

In [255]:
plotsize(8,6)
par(las=1)
avPlots(lmReduced)
In [256]:
Xmat = model.matrix(lmReduced)
dim(Xmat)
head(Xmat)
colnames(Xmat)
  1. 951
  2. 27
(Intercept)NumNonHBondsMolWeightFP154FP061FP111NumAromaticBondsNumChlorineFP072FP096FP063FP180FP041FP094FP133FP148FP056FP059FP091FP019
6611 4.009916 5.343673 0 0 0 2.833213 0.00000000 0 1 0 0 0 0 0 1 0 1 1
6621 4.871752 5.904108 1 0 0 2.564949 0.00000001 0 1 0 0 0 0 0 0 0 1 0
6631 3.705506 5.334215 0 1 0 1.945910 0.00000001 0 0 1 0 0 1 0 0 0 0 0
6651 3.076971 4.921877 0 0 0 0.000000 0.00000000 0 0 0 1 0 0 0 0 0 0 0
6681 3.705506 5.441335 0 0 1 1.945910 0.37500000 0 1 0 0 1 0 0 0 0 1 1
6691 3.593860 5.603041 0 0 1 0.000000 0.44444441 0 1 0 0 0 1 0 0 0 0 0
  1. '(Intercept)'
  2. 'NumNonHBonds'
  3. 'MolWeight'
  4. 'FP154'
  5. 'FP061'
  6. 'FP111'
  7. 'NumAromaticBonds'
  8. 'NumChlorine'
  9. 'FP072'
  10. 'FP096'
  11. 'FP083'
  12. 'FP081'
  13. 'FP128'
  14. 'FP126'
  15. 'FP073'
  16. 'FP044'
  17. 'FP023'
  18. 'FP063'
  19. 'FP180'
  20. 'FP041'
  21. 'FP094'
  22. 'FP133'
  23. 'FP148'
  24. 'FP056'
  25. 'FP059'
  26. 'FP091'
  27. 'FP019'
In [257]:
Xmat = as.data.frame(Xmat)
Xmat$Solubility = trainingData$Solubility
varFP154 = lm(FP154 ~ . - Solubility, data=Xmat)
varY = lm(Solubility ~ . - FP154, data=Xmat)
KWdata = data.frame(Q = varY$resid, C = (varFP154$resid > 0.4))
kruskal.test(Q ~ C, data=KWdata)
plotsize(4,4)
plot(varFP154$resid, varY$resid)
	Kruskal-Wallis rank sum test

data:  Q by C
Kruskal-Wallis chi-squared = 6.1013, df = 1, p-value = 0.01351
In [264]:
set.seed(0)
pvals = rep(NA,200)
varY = lm(Solubility ~ . - FP154, data=Xmat)
KWdata = data.frame(Q = varY$resid, C = (varFP154$resid > 0.4))
for (i in 1:200){ 
    out = kruskal.test(Q ~ C, data=KWdata[sample(nrow(KWdata), nrow(KWdata), replace=TRUE),])
    pvals[i] = out$p.value
}
plotsize(6,4)
hist(pvals,25)
In [261]:
varFP044 = lm(FP023 ~ . - Solubility, data=Xmat)
varY = lm(Solubility ~ . - FP023, data=Xmat)
KWdata = data.frame(Q = varY$resid, C = (varFP044$resid > 0.5))
set.seed(0)
pvals = rep(NA,200)
for (i in 1:200){ 
    out = kruskal.test(Q ~ C, data=KWdata[sample(nrow(KWdata), nrow(KWdata), replace=TRUE),])
    pvals[i] = out$p.value
}
hist(pvals, 25)
In [265]:
summary(lmReduced)
Call:
lm(formula = Solubility ~ NumNonHBonds + MolWeight + FP154 + 
    FP061 + FP111 + NumAromaticBonds + NumChlorine + FP072 + 
    FP096 + FP083 + FP081 + FP128 + FP126 + FP073 + FP044 + FP023 + 
    FP063 + FP180 + FP041 + FP094 + FP133 + FP148 + FP056 + FP059 + 
    FP091 + FP019, data = trainingData)

Residuals:
    Min      1Q  Median      3Q     Max 
-3.2284 -0.4835  0.0179  0.5329  3.0826 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)       7.83880    0.54188  14.466  < 2e-16 ***
NumNonHBonds     -1.04063    0.09605 -10.835  < 2e-16 ***
MolWeight        -1.49868    0.14842 -10.097  < 2e-16 ***
FP154            -0.47647    0.14910  -3.196 0.001442 ** 
FP061             0.45018    0.08935   5.038 5.65e-07 ***
FP111            -0.77153    0.09459  -8.156 1.12e-15 ***
NumAromaticBonds -0.28286    0.04251  -6.654 4.87e-11 ***
NumChlorine      -1.13708    0.19149  -5.938 4.08e-09 ***
FP072             1.16997    0.09401  12.445  < 2e-16 ***
FP096             0.80361    0.08871   9.058  < 2e-16 ***
FP083            -0.43590    0.09511  -4.583 5.21e-06 ***
FP081            -0.37210    0.08417  -4.421 1.10e-05 ***
FP128            -0.56503    0.09837  -5.744 1.26e-08 ***
FP126            -0.41813    0.10494  -3.985 7.29e-05 ***
FP073             0.42217    0.08668   4.870 1.31e-06 ***
FP044            -1.00712    0.13468  -7.478 1.76e-13 ***
FP023            -0.62348    0.10272  -6.070 1.87e-09 ***
FP063             1.58388    0.09844  16.090  < 2e-16 ***
FP180            -0.65568    0.11162  -5.874 5.93e-09 ***
FP041             0.49026    0.14607   3.356 0.000822 ***
FP094            -0.43362    0.07753  -5.593 2.94e-08 ***
FP133            -0.26155    0.08436  -3.100 0.001991 ** 
FP148             0.32749    0.11078   2.956 0.003194 ** 
FP056            -0.72098    0.12441  -5.795 9.35e-09 ***
FP059            -1.09582    0.12823  -8.546  < 2e-16 ***
FP091             0.63030    0.09964   6.325 3.93e-10 ***
FP019             0.27552    0.10057   2.740 0.006271 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.8044 on 924 degrees of freedom
Multiple R-squared:  0.8497,	Adjusted R-squared:  0.8455 
F-statistic:   201 on 26 and 924 DF,  p-value: < 2.2e-16

A nice result from stepwise regression is the above summary agrees with our assessment from the added-variable plots. Note that fingerprints 19, 133, 148, and 154 don't look significant and have relative large p-values (for t-test of coefficient significance). Compare with the above lmFull model summary with all coefficient with p-values > 0.01 removed:

Call:
lm(formula = .outcome ~ ., data = dat)

Residuals:
     Min       1Q   Median       3Q      Max 
-1.88083 -0.30042  0.00874  0.31417  1.64368 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
(Intercept)        9.531279   1.246753   7.645 6.47e-14 ***
FP040              0.505803   0.176207   2.871 0.004215 ** 
FP061             -0.491600   0.137319  -3.580 0.000366 ***
FP073             -0.590945   0.209965  -2.814 0.005015 ** 
FP076              0.726063   0.170738   4.252 2.38e-05 ***
FP079              0.625276   0.187041   3.343 0.000870 ***
FP081             -0.293340   0.108986  -2.692 0.007272 ** 
FP082              0.273170   0.097349   2.806 0.005146 ** 
FP085             -0.403105   0.143853  -2.802 0.005208 ** 
FP087              0.692471   0.263050   2.632 0.008653 ** 
FP096             -0.409433   0.146340  -2.798 0.005278 ** 
FP109              0.663429   0.221248   2.999 0.002803 ** 
FP111             -0.443722   0.144305  -3.075 0.002183 ** 
FP119              0.780114   0.270026   2.889 0.003977 ** 
FP127             -0.725249   0.171165  -4.237 2.55e-05 ***
FP130             -1.033193   0.379176  -2.725 0.006585 ** 
FP142              0.572553   0.149489   3.830 0.000139 ***
FP143              0.751869   0.285099   2.637 0.008533 ** 
FP154             -0.825790   0.200460  -4.119 4.22e-05 ***
FP164              0.607709   0.192168   3.162 0.001629 ** 
FP171              0.423581   0.120090   3.527 0.000446 ***
FP176              0.797610   0.253882   3.142 0.001747 ** 
FP202              0.480010   0.180908   2.653 0.008140 ** 
MolWeight         -1.422025   0.231033  -6.155 1.23e-09 ***
NumRotBonds       -0.460000   0.133451  -3.447 0.000599 ***
NumHydrogen        0.497291   0.157906   3.149 0.001702 ** 
NumOxygen          1.493861   0.401794   3.718 0.000216 ***
NumRings          -1.887635   0.441071  -4.280 2.12e-05 ***
SurfaceArea1       0.278697   0.039748   7.012 5.30e-12 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.5806 on 743 degrees of freedom
Multiple R-squared:  0.9371,    Adjusted R-squared:  0.9195 
F-statistic: 53.43 on 207 and 743 DF,  p-value: < 2.2e-16

Another aside: Variable screening

In [268]:
# Need to find the top 1/3 of variables for stepwise regression
# Perform stepwise on smaller datasets and keep track of those in final model
coltimes = rep(0, ncol(solTrainXtrans))
for (k in 1:20){
    #cat("starting CV division",k,"...\n")
    set.seed(k)
    colindex = sample(rep(1:10, length=ncol(solTrainXtrans)))
    for (j in 1:max(colindex)){
        trdat = solTrainXtrans[,colindex == j]
        trdat$Y = solTrainY
        out.full = lm(Y ~ ., data=trdat)
        out.step = step(out.full, trace=0, k=log(nrow(trdat)))
        inmodel = (colnames(solTrainXtrans) %in% colnames(out.step$model))
        coltimes[inmodel] = coltimes[inmodel] + 1
        }
    }
# [less than 1 minute to run]
In [271]:
plotsize(6,5)
boxplot(cor(solTrainXtrans, solTrainY) ~ coltimes)
abline(h=0, col='red')
sum(coltimes >= 10)
125
In [332]:
trainingData <- solTrainXtrans[, coltimes >= 10]
testingData <- solTestXtrans[, coltimes >= 10]
trainingData$Solubility <- solTrainY
lmFull <- lm(Solubility ~ ., data=trainingData)
out <- summary(lmFull)
In [273]:
starttime <- proc.time()[3]
lmReducedVS <- step(lmFull, trace=0, k = log(nrow(trainingData)))
proc.time()[3] - starttime
elapsed: 55.1419999999998
In [274]:
summary(lmReducedVS)
Call:
lm(formula = Solubility ~ FP002 + FP004 + FP005 + FP039 + FP040 + 
    FP053 + FP059 + FP061 + FP065 + FP068 + FP072 + FP074 + FP075 + 
    FP083 + FP085 + FP088 + FP089 + FP096 + FP099 + FP101 + FP111 + 
    FP124 + FP129 + FP135 + FP142 + FP152 + FP168 + FP172 + FP193 + 
    FP202 + FP204 + MolWeight + NumAtoms + NumNonHAtoms + NumBonds + 
    NumNonHBonds + NumRotBonds + NumAromaticBonds + NumHydrogen + 
    NumOxygen + NumRings + SurfaceArea1, data = trainingData)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.3331 -0.3543  0.0373  0.3785  2.0109 

Coefficients:
                  Estimate Std. Error t value Pr(>|t|)    
(Intercept)        1.14151    1.44437   0.790 0.429550    
FP002              0.57006    0.13575   4.199 2.94e-05 ***
FP004             -0.25297    0.07734  -3.271 0.001113 ** 
FP005              3.08177    0.50466   6.107 1.51e-09 ***
FP039             -0.38219    0.11172  -3.421 0.000652 ***
FP040              0.47755    0.10452   4.569 5.58e-06 ***
FP053              0.35060    0.09248   3.791 0.000160 ***
FP059             -0.49696    0.11277  -4.407 1.17e-05 ***
FP061             -0.67814    0.10817  -6.269 5.61e-10 ***
FP065             -0.21733    0.08232  -2.640 0.008432 ** 
FP068              0.26843    0.09699   2.768 0.005762 ** 
FP072             -0.92963    0.16554  -5.616 2.60e-08 ***
FP074              0.24742    0.06542   3.782 0.000166 ***
FP075              0.22989    0.07660   3.001 0.002764 ** 
FP083             -0.37852    0.08990  -4.210 2.80e-05 ***
FP085             -0.27603    0.06958  -3.967 7.85e-05 ***
FP088              0.22568    0.06234   3.620 0.000311 ***
FP089              0.74972    0.16888   4.439 1.01e-05 ***
FP096             -0.50520    0.11524  -4.384 1.30e-05 ***
FP099              0.61385    0.12436   4.936 9.49e-07 ***
FP101              0.39549    0.07756   5.099 4.16e-07 ***
FP111             -0.39298    0.09251  -4.248 2.38e-05 ***
FP124              0.31601    0.06654   4.749 2.37e-06 ***
FP129             -0.45040    0.13276  -3.393 0.000722 ***
FP135              0.40049    0.08752   4.576 5.40e-06 ***
FP142              0.68611    0.09808   6.996 5.12e-12 ***
FP152             -0.43346    0.11708  -3.702 0.000227 ***
FP168             -0.31184    0.11545  -2.701 0.007041 ** 
FP172             -0.37650    0.08723  -4.316 1.76e-05 ***
FP193             -0.26930    0.10333  -2.606 0.009303 ** 
FP202              0.35763    0.06590   5.427 7.35e-08 ***
FP204             -0.33635    0.09433  -3.565 0.000382 ***
MolWeight         -1.01387    0.15203  -6.669 4.47e-11 ***
NumAtoms         -14.31447    2.22269  -6.440 1.93e-10 ***
NumNonHAtoms      18.13247    2.43942   7.433 2.45e-13 ***
NumBonds           9.20468    1.72276   5.343 1.16e-07 ***
NumNonHBonds     -10.68536    1.31394  -8.132 1.38e-15 ***
NumRotBonds       -0.36386    0.08229  -4.422 1.10e-05 ***
NumAromaticBonds  -1.75822    0.25924  -6.782 2.13e-11 ***
NumHydrogen        1.23439    0.15596   7.915 7.20e-15 ***
NumOxygen          1.64301    0.23903   6.874 1.16e-11 ***
NumRings           1.84099    0.36957   4.981 7.56e-07 ***
SurfaceArea1       0.24456    0.01572  15.559  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.6194 on 908 degrees of freedom
Multiple R-squared:  0.9125,	Adjusted R-squared:  0.9084 
F-statistic: 225.4 on 42 and 908 DF,  p-value: < 2.2e-16
In [275]:
plotsize(8,6)
par(las=1)
avPlots(lmReducedVS)
In [276]:
testResults$step = predict(lmReduced, solTestXtrans)
testResults$stepVS = predict(lmReducedVS, solTestXtrans)
cor(testResults)
obsLinear_RegressionstepstepVS
obs1.00000000.93135970.90711960.9337202
Linear_Regression0.93135971.00000000.94799450.9752667
step0.90711960.94799451.00000000.9517860
stepVS0.93372020.97526670.95178601.0000000
In [277]:
a = testResults
for (j in 2:4)
cat(colnames(a)[j], 1 - sum((a[,1] - a[,j])^2)/sum((a[,1]-mean(a[,1]))^2),'\n')
Linear_Regression 0.8663201 
step 0.8214145 
stepVS 0.8713994 

Using a rough variable screening technique yields better results (in a reasonable amount of time). Variable screening techniques become more important as the number of predictors brings some computationally-intense methods to a crawl. Note that the stepVS achieves an $R^2$ of 0.869 while Linear_Regression achieves only 0.866. Compare this to the estimated R2 and R2-adjusted values:

method est. $R^2$ $R^2$-$adj$ actual $R^2$
Linear Regression 0.937 0.920 0.866
Step Regression 0.850 0.846 0.821
Step with VS 0.912 0.908 0.871

There is likely something fundamentally different between the training set and the testing set.

Checking model assumptions

Check for outliers, nonrandomness, other structure, or nonnormality:

  • plot of Y vs. Predicted
  • plot of Residuals vs. Predicted
  • histogram of Residuals (for normality)
In [278]:
plotsize(4,4)
xyplot(solTrainY ~ predict(lmTune), type=c("p","g"), #plot points (p) and grid (g)
      xlab="Predicted", ylab="Observed")
In [279]:
xyplot(resid(lmTune) ~ predict(lmTune), type=c("p","g"), #plot points (p) and grid (g)
      xlab="Predicted", ylab="Residuals")
plotsize(6,3)
hist(resid(lmTune))
In [281]:
# warning: don't rely on shapiro.test... everything is nonnormal with enough data
shapiro.test(resid(lmTune))
# the test suggest rejecting H0: {normal errors}  in favor of H1: {lack of normality}
	Shapiro-Wilk normality test

data:  resid(lmTune)
W = 0.99619, p-value = 0.02004

Everything is fine if truly from a normal distribution

In [198]:
plotsize(7,3)
par(mfrow=c(1,3))
for (j in 1:3){
    sampsize = 5*10^j
    pvals = rep(NA, 1000)
    for (k in 1:length(pvals)){
        set.seed(k)
        a <- rnorm(sampsize)
        pvals[k] = shapiro.test(a)$p.value
    }
    hist(pvals, main=sprintf('Non-normality test p-values\nfor size = %i',sampsize))
}

... but minor deviations that are unconcerning will raise a flag

In [282]:
plotsize(7,3)
par(mfrow=c(1,3))
for (j in 1:3){
    sampsize = 5*10^j
    pvals = rep(NA, 1000)
    for (k in 1:length(pvals)){
        set.seed(k)
        a <- rt(sampsize,30)
        pvals[k] = shapiro.test(a)$p.value
    }
    hist(pvals, main=sprintf('Non-normality test p-values\nfor size = %i',sampsize))
}
In [211]:
plotsize(8,5)
par(mfrow=c(1,2),las=1)
plot(a <- -400:400/100, pnorm(a), type='l',lwd=3,col='red',xlab='x',ylab='density',
    main='cumulative distribution function')
lines(a, pt(a,30), lwd=1.5)
plot(a <- -400:400/100, pnorm(a), type='l',lwd=3,col='red',log='y',xlab='x',
     main='cumulative distribution function')
lines(a, pt(a,30), lwd=1.5)

seen another way, standard normals rounded to the nearest 0.1 lead to a nonnormal test result for large samples

In [220]:
plotsize(7,3)
par(mfrow=c(1,3))
for (j in 1:3){
    sampsize = 5*10^j
    pvals = rep(NA, 1000)
    for (k in 1:length(pvals)){
        set.seed(k)
        a <- round(rnorm(sampsize),1)
        pvals[k] = shapiro.test(a)$p.value
    }
    hist(pvals, main=sprintf('Non-normality test p-values\nfor size = %i',sampsize), breaks=0:10/10)
}

To get a favor of how complicated regression in caret can be:

In [284]:
rlmPCA <- train(solTrainXtrans, solTrainY, method="rlm", # robust linear model
               preProcess = "pca", # using PCA preprocessing, as in Class 1
               trControl = ctrl)
rlmPCA
Robust Linear Model 

951 samples
228 predictors

Pre-processing: principal component signal extraction (228), centered
 (228), scaled (228) 
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 714, 712, 713, 714 
Resampling results across tuning parameters:

  intercept  psi           RMSE       Rsquared 
  FALSE      psi.huber     2.8244801  0.8527755
  FALSE      psi.hampel    2.8242437  0.8528197
  FALSE      psi.bisquare  2.8244468  0.8527235
   TRUE      psi.huber     0.7905917  0.8526263
   TRUE      psi.hampel    0.7895953  0.8529274
   TRUE      psi.bisquare  0.7951582  0.8507417

RMSE was used to select the optimal model using  the smallest value.
The final values used for the model were intercept = TRUE and psi = psi.hampel.
In [285]:
testResults$RLM <- predict(rlmPCA, solTestXtrans)

Principal components and Partial least squares regression

Principal components regression (PCR)

Singular value decomposition is used to decompose $\mathbf{X}$ ($n \times p$ matrix) into a orthonormal matrix ($p \times p$) that best explains the row space of $\mathbf{X}$ (the $\mathbb{R}^p$ space spanned by the rows of $\mathbf{X}$)

Then we see how few of these orthonormal vectors can still yield a good prediction model for $y$. Note that how the vectors correlate with $y$ are not considered in coming up with the machinery of the model. If we decompose $\mathbf{X}'\mathbf{X}$ ($p \times p$ matrix), related to the covariance of $\mathbf{X}$, we get eigenvectors (the same orthonormal vectors mentioned above) and eigenvalues. It's the eignevectors of $\mathbf{X}'\mathbf{X}$ that are used as regressors in PCR.

Partial least squares regression (PLSR)

Contrasting with the $y$-indifference of PCR is PLSR, where regressors are determined by their ability to best correlate with $y$. Then we see how few of these $y$-correlated orthonormal vectors can still yield a good prediction model for $y$.

  • Step 1: find $\mathbf{w}$ maximizing $\left| \right|$ with $||\mathbf{w}|| = 1$ and $<\ ,\ >$ denoting dot product.
  • Step j: find a new $\mathbf{w}$ (orthogonal to all other $\mathbf{w}$ selected so far) maximizing $\left| \right|$ with $||\mathbf{w}|| = 1$ where $y^{(j-1)}$ is the variation of the $y$ not yet explained by the first $j-1$ components.
In [288]:
# (C) Kuhn and Johnson (2013)
################################################################################
### Section 6.3 Partial Least Squares

## Run PLS and PCR on solubility data and compare results
set.seed(100)
plsTune <- train(x = solTrainXtrans, y = solTrainY,
                 method = "pls",
                 tuneGrid = expand.grid(ncomp = 1:20),
                 trControl = ctrl,
                 preProc = c("center", "scale"))
plsTune
Partial Least Squares 

951 samples
228 predictors

Pre-processing: centered (228), scaled (228) 
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 714, 712, 713, 714 
Resampling results across tuning parameters:

  ncomp  RMSE       Rsquared 
   1     1.2847701  0.6060391
   2     1.0631439  0.7340300
   3     0.9438137  0.7913146
   4     0.8610820  0.8253676
   5     0.8230279  0.8404437
   6     0.7904546  0.8523686
   7     0.7595864  0.8638900
   8     0.7496363  0.8670613
   9     0.7427455  0.8693892
  10     0.7371369  0.8718304
  11     0.7355266  0.8724435
  12     0.7365313  0.8722616
  13     0.7394828  0.8712471
  14     0.7365844  0.8720858
  15     0.7385447  0.8713001
  16     0.7395241  0.8710431
  17     0.7375407  0.8715039
  18     0.7405489  0.8706390
  19     0.7394964  0.8709409
  20     0.7432605  0.8697985

RMSE was used to select the optimal model using  the smallest value.
The final value used for the model was ncomp = 11.
In [289]:
testResults$PLS <- predict(plsTune, solTestXtrans)
In [292]:
set.seed(100)
pcrTune <- train(x = solTrainXtrans, y = solTrainY,
                 method = "pcr",
                 tuneGrid = expand.grid(ncomp = 1:40),  
                 trControl = ctrl)
                 # technically, minimum at 150
pcrTune
Principal Component Analysis 

951 samples
228 predictors

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 714, 712, 713, 714 
Resampling results across tuning parameters:

  ncomp  RMSE       Rsquared  
   1     1.9793445  0.06527653
   2     1.6361649  0.36023430
   3     1.3672407  0.55487462
   4     1.3677256  0.55454864
   5     1.3384546  0.57264083
   6     1.2174304  0.64816262
   7     1.1605416  0.68151157
   8     1.1481894  0.68775119
   9     1.0579073  0.73425216
  10     1.0177000  0.75480429
  11     0.9876299  0.77031998
  12     0.9812137  0.77330600
  13     0.9722017  0.77749723
  14     0.9538373  0.78571424
  15     0.9557040  0.78493485
  16     0.8908691  0.81392789
  17     0.8834280  0.81723062
  18     0.8786584  0.81905652
  19     0.8801428  0.81847917
  20     0.8745989  0.82072111
  21     0.8291275  0.83925721
  22     0.8176284  0.84326043
  23     0.8164309  0.84361040
  24     0.8170198  0.84353418
  25     0.8152417  0.84419521
  26     0.8110317  0.84574281
  27     0.8117539  0.84547426
  28     0.8119480  0.84550859
  29     0.8022437  0.84936826
  30     0.7993359  0.85010542
  31     0.7954732  0.85159539
  32     0.7776087  0.85750669
  33     0.7544182  0.86566942
  34     0.7526671  0.86650840
  35     0.7405891  0.87044024
  36     0.7388585  0.87104149
  37     0.7372408  0.87163256
  38     0.7281077  0.87486393
  39     0.7281933  0.87489223
  40     0.7286008  0.87477044

RMSE was used to select the optimal model using  the smallest value.
The final value used for the model was ncomp = 38.
In [294]:
plsResamples <- plsTune$results
plsResamples$Model <- "PLS"
pcrResamples <- pcrTune$results
pcrResamples$Model <- "PCR"
plsPlotData <- rbind(plsResamples, pcrResamples)
In [295]:
xyplot(RMSE ~ ncomp,
       data = plsPlotData,
       #aspect = 1,
       xlab = "# Components",
       ylab = "RMSE (Cross-Validation)",
       auto.key = list(columns = 2),
       groups = Model,
       type = c("o", "g"))

plsImp <- varImp(plsTune, scale = FALSE)
plotsize(7,5)
plot(plsImp, top = 25, scales = list(y = list(cex = .95)))
In [296]:
testResults$PCR <- predict(pcrTune, solTestXtrans)
In [297]:
plot(testResults)
In [298]:
cor(testResults)
obsLinear_RegressionstepstepVSRLMPLSPCR
obs1.00000000.93135970.90711960.93372020.92583430.93915580.9222715
Linear_Regression0.93135971.00000000.94799450.97526670.96297110.98428140.9663267
step0.90711960.94799451.00000000.95178600.95889280.95577750.9589254
stepVS0.93372020.97526670.95178601.00000000.97069790.97793010.9808655
RLM0.92583430.96297110.95889280.97069791.00000000.98657610.9838603
PLS0.93915580.98428140.95577750.97793010.98657611.00000000.9787236
PCR0.92227150.96632670.95892540.98086550.98386030.97872361.0000000

Penalized regression models

Ridge regression finds the linear fit coefficients $\mathbf{b}$ that minimize $SSE + \lambda \sum_j b_j^2$. This results in a shrinking of the OLS estimate towards zero. Note that $\lambda$ is a penalty to be tuned: if $\lambda = 0$ then there is no penalty, resulting in the OLS estimate. If $\lambda$ is very large, then the fit will only include the intercept.

LASSO regression minimizes $SSE + \lambda \sum_j \left|b_j\right|$. This results in a sparse model as coefficients will often be zeroed out to reduce the fit's penalty.

An elastic net is a combination (and is superior) to both of these penalized regression models. Tune an additional parameter $\alpha \in [0,1]$ that emphasizes LASSO ($\alpha=0$) or Ridge ($\alpha=1$).

In [ ]:
# (C) Kuhn and Johnson (2013)
################################################################################
### Section 6.4 Penalized Models

## The text used the elasticnet to obtain a ridge regression model.
## There is now a simple ridge regression method.

# ridgeGrid <- expand.grid(lambda = seq(0, .1, length = 15))   
    # this is the original call, resulting in:
Ridge Regression 

951 samples
228 predictors

Pre-processing: centered (228), scaled (228) 
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 856, 857, 855, 856, 856, 855, ... 
Resampling results across tuning parameters:

  lambda       RMSE       Rsquared 
  0.000000000  0.7207131  0.8769711
  0.007142857  0.7047552  0.8818659
  0.014285714  0.6964731  0.8847911
  0.021428571  0.6925923  0.8862699
  0.028571429  0.6908607  0.8870609
  0.035714286  0.6904220  0.8874561
  0.042857143  0.6908548  0.8875998
  0.050000000  0.6919152  0.8875759
  0.057142857  0.6934719  0.8874300
  0.064285714  0.6954114  0.8872009
  0.071428571  0.6976723  0.8869096
  0.078571429  0.7002069  0.8865723
  0.085714286  0.7029801  0.8862009
  0.092857143  0.7059656  0.8858041
  0.100000000  0.7091432  0.8853885

RMSE was used to select the optimal model using  the smallest value.
The final value used for the model was lambda = 0.03571429.
In [317]:
ridgeGrid <- expand.grid(lambda = seq(.03, .05, length = 5)) 
set.seed(100)
ridgeTune <- train(x = solTrainXtrans, y = solTrainY,
                   method = "ridge",
                   tuneGrid = ridgeGrid,
                   trControl = ctrl,
                   preProc = c("center", "scale"))
ridgeTune
Ridge Regression 

951 samples
228 predictors

Pre-processing: centered (228), scaled (228) 
Resampling: Cross-Validated (4 fold) 
Summary of sample sizes: 714, 712, 713, 714 
Resampling results across tuning parameters:

  lambda  RMSE       Rsquared 
  0.030   0.7146464  0.8804333
  0.035   0.7137952  0.8809102
  0.040   0.7134550  0.8812434
  0.045   0.7135146  0.8814678
  0.050   0.7138969  0.8816080

RMSE was used to select the optimal model using  the smallest value.
The final value used for the model was lambda = 0.04.

Questions:

  • why is the RMSE higher here with 4-fold CV (instead of 10-fold CV)?
  • why is a larger lambda chosen with 4-fold CV?
In [318]:
print(update(plot(ridgeTune), xlab = "Penalty"))
testResults$Ridge <- predict(ridgeTune, solTestXtrans)
In [320]:
enetGrid <- expand.grid(lambda = c(0, 0.01, .1), 
                        fraction = seq(.05, 1, length = 20))
set.seed(100)
starttime <- proc.time()[3]
enetTune <- train(x = solTrainXtrans, y = solTrainY,
                  method = "enet",
                  tuneGrid = enetGrid,
                  trControl = ctrl,
                  preProc = c("center", "scale"))
proc.time()[3] - starttime
enetTune

plot(enetTune)

testResults$Enet <- predict(enetTune, solTestXtrans)
elapsed: 22.3169999999991
Elasticnet 

951 samples
228 predictors

Pre-processing: centered (228), scaled (228) 
Resampling: Cross-Validated (4 fold) 
Summary of sample sizes: 714, 712, 713, 714 
Resampling results across tuning parameters:

  lambda  fraction  RMSE       Rsquared 
  0.00    0.05      0.8558983  0.8402335
  0.00    0.10      0.6977481  0.8857478
  0.00    0.15      0.6954231  0.8855478
  0.00    0.20      0.7033247  0.8830294
  0.00    0.25      0.7146147  0.8795669
  0.00    0.30      0.7264827  0.8760171
  0.00    0.35      0.7381596  0.8723547
  0.00    0.40      0.7442228  0.8704930
  0.00    0.45      0.7468520  0.8696942
  0.00    0.50      0.7474171  0.8695483
  0.00    0.55      0.7476504  0.8695461
  0.00    0.60      0.7482738  0.8694260
  0.00    0.65      0.7489758  0.8692365
  0.00    0.70      0.7502272  0.8688819
  0.00    0.75      0.7519354  0.8683761
  0.00    0.80      0.7539205  0.8677921
  0.00    0.85      0.7557336  0.8672453
  0.00    0.90      0.7578664  0.8665910
  0.00    0.95      0.7603577  0.8658211
  0.00    1.00      0.7632294  0.8649296
  0.01    0.05      1.5000109  0.6591793
  0.01    0.10      1.1099197  0.7721361
  0.01    0.15      0.8886350  0.8304641
  0.01    0.20      0.7757848  0.8619496
  0.01    0.25      0.7260250  0.8771015
  0.01    0.30      0.7052569  0.8830188
  0.01    0.35      0.6982685  0.8849287
  0.01    0.40      0.6966915  0.8851580
  0.01    0.45      0.6963102  0.8851669
  0.01    0.50      0.6959188  0.8852447
  0.01    0.55      0.6964799  0.8850942
  0.01    0.60      0.6985363  0.8844835
  0.01    0.65      0.7013033  0.8836357
  0.01    0.70      0.7054437  0.8823511
  0.01    0.75      0.7090996  0.8812578
  0.01    0.80      0.7122910  0.8802928
  0.01    0.85      0.7160200  0.8791711
  0.01    0.90      0.7200947  0.8779476
  0.01    0.95      0.7245767  0.8765992
  0.01    1.00      0.7289229  0.8752945
  0.10    0.05      1.6828615  0.5198609
  0.10    0.10      1.3992051  0.7007165
  0.10    0.15      1.1603286  0.7627053
  0.10    0.20      1.0011273  0.7906846
  0.10    0.25      0.8914899  0.8235494
  0.10    0.30      0.8166379  0.8456134
  0.10    0.35      0.7748438  0.8584058
  0.10    0.40      0.7520446  0.8664789
  0.10    0.45      0.7385339  0.8717055
  0.10    0.50      0.7319684  0.8747713
  0.10    0.55      0.7273820  0.8770704
  0.10    0.60      0.7244510  0.8786742
  0.10    0.65      0.7227311  0.8797819
  0.10    0.70      0.7225545  0.8803292
  0.10    0.75      0.7230807  0.8805898
  0.10    0.80      0.7239593  0.8806933
  0.10    0.85      0.7247163  0.8808035
  0.10    0.90      0.7256278  0.8808656
  0.10    0.95      0.7267011  0.8808647
  0.10    1.00      0.7279391  0.8807845

RMSE was used to select the optimal model using  the smallest value.
The final values used for the model were fraction = 0.15 and lambda = 0.
In [327]:
enetGrid2 <- expand.grid(lambda = c(.01, .02), 
                        fraction = seq(.4, .6, length = 5))
enetTune2 <- train(x = solTrainXtrans, y = solTrainY,
                  method = "enet",
                  tuneGrid = enetGrid2,
                  trControl = ctrl,
                  preProc = c("center", "scale"))

enetTune2
plot(enetTune2)

testResults$Enet2 <- predict(enetTune2, solTestXtrans)
Elasticnet 

951 samples
228 predictors

Pre-processing: centered (228), scaled (228) 
Resampling: Cross-Validated (4 fold) 
Summary of sample sizes: 714, 712, 713, 714 
Resampling results across tuning parameters:

  lambda  fraction  RMSE       Rsquared 
  0.01    0.40      0.6966915  0.8851580
  0.01    0.45      0.6963102  0.8851669
  0.01    0.50      0.6959188  0.8852447
  0.01    0.55      0.6964799  0.8850942
  0.01    0.60      0.6985363  0.8844835
  0.02    0.40      0.7013337  0.8838869
  0.02    0.45      0.6991341  0.8844340
  0.02    0.50      0.6978714  0.8847609
  0.02    0.55      0.6972920  0.8849490
  0.02    0.60      0.6974494  0.8849530

RMSE was used to select the optimal model using  the smallest value.
The final values used for the model were fraction = 0.5 and lambda = 0.01.
In [328]:
cor(testResults)
obsLinear_RegressionstepstepVSRLMPLSPCRRidgeEnetEnet2
obs1.00000000.93135970.90711960.93372020.92583430.93915580.92227150.93880130.94136240.9407979
Linear_Regression0.93135971.00000000.94799450.97526670.96297110.98428140.96632670.99185690.98826070.9884263
step0.90711960.94799451.00000000.95178600.95889280.95577750.95892540.95986370.96356280.9653210
stepVS0.93372020.97526670.95178601.00000000.97069790.97793010.98086550.98127120.98791770.9869279
RLM0.92583430.96297110.95889280.97069791.00000000.98657610.98386030.98347810.98344530.9852944
PLS0.93915580.98428140.95577750.97793010.98657611.00000000.97872360.99671790.99380030.9946541
PCR0.92227150.96632670.95892540.98086550.98386030.97872361.00000000.98130660.98617550.9862199
Ridge0.93880130.99185690.95986370.98127120.98347810.99671790.98130661.00000000.99630900.9973560
Enet0.94136240.98826070.96356280.98791770.98344530.99380030.98617550.99630901.00000000.9996274
Enet20.94079790.98842630.96532100.98692790.98529440.99465410.98621990.99735600.99962741.0000000
In [330]:
cat("Training Errors:\n")
fullpred <- predict(enetTune, solTrainXtrans)
cat('Enet: ',1 - sum((fullpred - solTrainY)^2) / sum((solTrainY - mean(solTrainY))^2),": ")
fullpred <- predict(ridgeTune, solTrainXtrans)
cat('Ridge:',1 - sum((fullpred - solTrainY)^2) / sum((solTrainY - mean(solTrainY))^2),": ")
fullpred <- predict(lmReducedVS, solTrainXtrans)
cat('Step: ',1 - sum((fullpred - solTrainY)^2) / sum((solTrainY - mean(solTrainY))^2),":\n")
fullpred <- predict(lmTune, solTrainXtrans)
cat('LM:   ',1 - sum((fullpred - solTrainY)^2) / sum((solTrainY - mean(solTrainY))^2),": ")
fullpred <- predict(rlmPCA, solTrainXtrans)
cat('RLM:  ',1 - sum((fullpred - solTrainY)^2) / sum((solTrainY - mean(solTrainY))^2)," : ")
fullpred <- predict(plsTune, solTrainXtrans)
cat('PLSR: ',1 - sum((fullpred - solTrainY)^2) / sum((solTrainY - mean(solTrainY))^2),": ")
fullpred <- predict(pcrTune, solTrainXtrans)
cat('PCR:  ',1 - sum((fullpred - solTrainY)^2) / sum((solTrainY - mean(solTrainY))^2),"\n")
cat("Testing Errors:\n") 
fullpred <- predict(enetTune, solTestXtrans)
cat('Enet: ',1 - sum((fullpred - solTestY)^2) / sum((solTestY - mean(solTestY))^2)," : ")
fullpred <- predict(ridgeTune, solTestXtrans)
cat('Ridge:',1 - sum((fullpred - solTestY)^2) / sum((solTestY - mean(solTestY))^2),": ")
fullpred <- predict(lmReducedVS, solTestXtrans)
cat('Step: ',1 - sum((fullpred - solTestY)^2) / sum((solTestY - mean(solTestY))^2),":\n")
fullpred <- predict(lmTune, solTestXtrans)
cat('LM:   ',1 - sum((fullpred - solTestY)^2) / sum((solTestY - mean(solTestY))^2),": ")
fullpred <- predict(rlmPCA, solTestXtrans)
cat('RLM:  ',1 - sum((fullpred - solTestY)^2) / sum((solTestY - mean(solTestY))^2),": ")
fullpred <- predict(plsTune, solTestXtrans)
cat('PLSR: ',1 - sum((fullpred - solTestY)^2) / sum((solTestY - mean(solTestY))^2),": ")
fullpred <- predict(pcrTune, solTestXtrans)
cat('PCR:  ',1 - sum((fullpred - solTestY)^2) / sum((solTestY - mean(solTestY))^2),"\n")
Training Errors:
Enet:  0.9227276 : Ridge: 0.9284916 : Step:  0.9124697 :
LM:    0.9370535 : RLM:   0.880426  : PLSR:  0.9171141 : PCR:   0.8852926 
Testing Errors:
Enet:  0.885803  : Ridge: 0.8798271 : Step:  0.8713994 :
LM:    0.8663201 : RLM:   0.8561811 : PLSR:  0.880997 : PCR:   0.8497764 

Part 6. Nonlinear Regression Models

Neural Networks

The "traditional" regression methods seem to achieve a training $R^2$ of 0.88-0.89 when predicting solubility (estimated with 10-fold cross-validation). Using cross-validation gives a more accurate assessment of how the method performs on future data as those predicted values didn't include their own data in estimating model coefficients. Contrast this to the training error from the full model:

Neural networks embed a hidden structure that is additive according to some transform of variables. Each variable and hidden layer add a large number of coefficients to estimate, so tuning is very important.

  • A single hidden layer is like PLSR in that linear combination coefficients are found for each node that maximizes predictive ability of $y$, except for NNs there is a transform that requires more coefficient estimation.
  • Adding a second hidden layer allows a two-stage PLSR where the first layer becomes features of the predictors (automated feature engineering?), aided by the transforms embedded in the combination process.
In [344]:
# (C) Kuhn and Johnson, 2013
################################################################################
### Section 7.1 Neural Networks

### Optional: parallel processing can be used via the 'do' packages,
### such as doMC, doMPI etc. We used doMC (not on Windows) to speed
### up the computations.

### WARNING: Be aware of how much memory is needed to parallel
### process. It can very quickly overwhelm the availible hardware. We
### estimate the memory usuage (VSIZE = total memory size) to be 
### 2677M/core.

registerDoMC(4)

# # this took too long to run (not complete after 1 hour)
# nnetGrid <- expand.grid(decay = c(0, 0.01, .1), 
#                         size = c(1, 3, 5, 7, 9, 11, 13), 
#                         bag = FALSE)

# advice when tuning: try an initial grid that is very broad and sparse,
#    like decay = c(0, .5, 10) and size = c(3, 9, 17)
#    and then fine-tune based on results

nnetGrid <- expand.grid(size = c(1, 4, 7, 10), decay = c(0.01, 0.1, 1, 10)) 
#                        bag = FALSE)
nnetGrid

set.seed(100)
indx2 <- createFolds(solTrainY, returnTrain = TRUE, k=2)  # for speed purposes, I use 4-fold CV
str(indx2)
ctrl2 <- trainControl(method = "cv", index = indx2)
ctrl2$number = length(ctrl2$index)  # for some reason, this isn't set correctly in the function trainControl
which(colnames(trainingData) == "Solubility")
dim(trainingData); dim(testingData)
sizedecay
1 0.01
4 0.01
7 0.01
10 0.01
1 0.10
4 0.10
7 0.10
10 0.10
1 1.00
4 1.00
7 1.00
10 1.00
1 10.00
4 10.00
7 10.00
10 10.00
List of 2
 $ Fold1: int [1:475] 1 3 4 6 8 9 11 12 14 17 ...
 $ Fold2: int [1:476] 2 5 7 10 13 15 16 18 23 24 ...
126
  1. 951
  2. 126
  1. 316
  2. 125
In [345]:
set.seed(100)
starttime <- proc.time()[3]
nnetTune <- train(x = trainingData[,-126], y = trainingData[,126],
                  method = "nnet",
                  tuneGrid = nnetGrid,
                  trControl = ctrl2,
                  preProc = c("center", "scale"),
                  linout = TRUE,
                  trace = TRUE,
                  #MaxNWts = 13 * (ncol(solTrainXtrans) + 1) + 13 + 1,
                  MaxNWts = 7 * (ncol(testingData) + 1) + 7 + 1,
                  maxit = 250,
                  allowParallel = FALSE)
proc.time()[3] - starttime
nnetTune
Warning message in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
“There were missing values in resampled performance measures.”Warning message in train.default(x = trainingData[, -126], y = trainingData[, 126], :
“missing values found in aggregated results”
# weights:  890
initial  value 9585.425224 
iter  10 value 2232.793634
iter  20 value 1526.400000
iter  30 value 1393.920142
iter  40 value 1128.481718
iter  50 value 826.935972
iter  60 value 755.365096
iter  70 value 715.801095
iter  80 value 637.959174
iter  90 value 564.382399
iter 100 value 551.187837
iter 110 value 534.975691
iter 120 value 520.258302
iter 130 value 503.358268
iter 140 value 497.664079
iter 150 value 490.517130
iter 160 value 485.160240
iter 170 value 481.366710
iter 180 value 478.214630
iter 190 value 475.628634
iter 200 value 473.516252
iter 210 value 471.589907
iter 220 value 469.833317
iter 230 value 468.927843
iter 240 value 467.950582
iter 250 value 467.435811
final  value 467.435811 
stopped after 250 iterations
elapsed: 15.6169999999984
Neural Network 

951 samples
125 predictors

Pre-processing: centered (125), scaled (125) 
Resampling: Cross-Validated (2 fold) 
Summary of sample sizes: 475, 476 
Resampling results across tuning parameters:

  size  decay  RMSE       Rsquared 
   1     0.01  1.0330161  0.7479345
   1     0.10  0.8069592  0.8490484
   1     1.00  0.8102457  0.8459083
   1    10.00  0.9373774  0.8067731
   4     0.01  1.1842109  0.7042725
   4     0.10  0.9010325  0.8177184
   4     1.00  0.7760414  0.8620772
   4    10.00  0.7412786  0.8718713
   7     0.01  1.1049560  0.7341082
   7     0.10  1.0579855  0.7653938
   7     1.00  0.7129022  0.8832211
   7    10.00  0.7061993  0.8829121
  10     0.01        NaN        NaN
  10     0.10        NaN        NaN
  10     1.00        NaN        NaN
  10    10.00        NaN        NaN

RMSE was used to select the optimal model using  the smallest value.
The final values used for the model were size = 7 and decay = 10.
In [346]:
plot(nnetTune)
In [347]:
nnetGrid2 <- rbind(
    expand.grid(size = c(1, 3), decay = c(0.05, 0.2, 0.8)),
    expand.grid(size = c(6, 8), decay = c(0.5, 2, 8)))
    
nnetGrid2
sizedecay
1 0.05
3 0.05
1 0.20
3 0.20
1 0.80
3 0.80
6 0.50
8 0.50
6 2.00
8 2.00
6 8.00
8 8.00
In [348]:
set.seed(100)
starttime <- proc.time()[3]
nnetTune2 <- train(x = trainingData[,-126], y = trainingData[,126],
                  method = "nnet",
                  tuneGrid = nnetGrid2,
                  trControl = ctrl2,
                  preProc = c("center", "scale"),
                  linout = TRUE,
                  trace = TRUE,
                  MaxNWts = 7 * (ncol(testingData) + 1) + 7 + 1,
                  maxit = 250,
                  allowParallel = FALSE)
proc.time()[3] - starttime
nnetTune2
Warning message in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
“There were missing values in resampled performance measures.”Warning message in train.default(x = trainingData[, -126], y = trainingData[, 126], :
“missing values found in aggregated results”
# weights:  763
initial  value 14999.279105 
iter  10 value 3768.894551
iter  20 value 3035.918276
iter  30 value 2171.404942
iter  40 value 1287.525635
iter  50 value 1113.598243
iter  60 value 988.035967
iter  70 value 780.996926
iter  80 value 700.226011
iter  90 value 622.599992
iter 100 value 564.440819
iter 110 value 532.650502
iter 120 value 508.425732
iter 130 value 495.327059
iter 140 value 480.842339
iter 150 value 472.198417
iter 160 value 467.182838
iter 170 value 462.436853
iter 180 value 458.486156
iter 190 value 454.887970
iter 200 value 451.587484
iter 210 value 448.689749
iter 220 value 445.562446
iter 230 value 442.369159
iter 240 value 439.586191
iter 250 value 437.823052
final  value 437.823052 
stopped after 250 iterations
elapsed: 8.75499999999738
Neural Network 

951 samples
125 predictors

Pre-processing: centered (125), scaled (125) 
Resampling: Cross-Validated (2 fold) 
Summary of sample sizes: 475, 476 
Resampling results across tuning parameters:

  size  decay  RMSE       Rsquared 
  1     0.05   0.8122719  0.8463901
  1     0.20   0.7836828  0.8574688
  1     0.80   0.8057688  0.8479024
  3     0.05   1.0319769  0.7716858
  3     0.20   0.9568722  0.8028368
  3     0.80   0.8034844  0.8503995
  6     0.50   0.7676766  0.8669188
  6     2.00   0.7331277  0.8746203
  6     8.00   0.7040760  0.8830132
  8     0.50         NaN        NaN
  8     2.00         NaN        NaN
  8     8.00         NaN        NaN

RMSE was used to select the optimal model using  the smallest value.
The final values used for the model were size = 6 and decay = 8.
In [349]:
plot(nnetTune2)
In [350]:
nnetGrid3 <- rbind(
    expand.grid(size = c(3, 5), decay = c(0.5, 1.5, 5)),
    expand.grid(size = c(5, 7), decay = c(3, 9, 27)))
    
nnetGrid3
sizedecay
3 0.5
5 0.5
3 1.5
5 1.5
3 5.0
5 5.0
5 3.0
7 3.0
5 9.0
7 9.0
5 27.0
7 27.0
In [354]:
set.seed(100)
starttime <- proc.time()[3]
nnetTune3 <- train(x = trainingData[,-126], y = trainingData[,126],
                  method = "nnet",
                  tuneGrid = nnetGrid3,
                  trControl = ctrl2,
                  preProc = c("center", "scale"),
                  linout = TRUE,
                  trace = TRUE,
                  MaxNWts = 7 * (ncol(testingData) + 1) + 7 + 1,
                  maxit = 1000,
                  allowParallel = FALSE)
proc.time()[3] - starttime
nnetTune3

plot(nnetTune3)
# weights:  890
initial  value 11971.349852 
iter  10 value 3117.259536
iter  20 value 2223.908049
iter  30 value 1890.744877
iter  40 value 1676.390611
iter  50 value 1373.165661
iter  60 value 957.998751
iter  70 value 740.067868
iter  80 value 623.519545
iter  90 value 573.495261
iter 100 value 543.170005
iter 110 value 489.501933
iter 120 value 450.503098
iter 130 value 397.694680
iter 140 value 375.968677
iter 150 value 352.054299
iter 160 value 335.985170
iter 170 value 322.428123
iter 180 value 309.073627
iter 190 value 302.791449
iter 200 value 294.449981
iter 210 value 288.689126
iter 220 value 282.763979
iter 230 value 278.411092
iter 240 value 273.810945
iter 250 value 269.548386
iter 260 value 266.298947
iter 270 value 263.266331
iter 280 value 260.910396
iter 290 value 258.912139
iter 300 value 257.411988
iter 310 value 256.229108
iter 320 value 254.669137
iter 330 value 253.507713
iter 340 value 252.450852
iter 350 value 251.344894
iter 360 value 250.374238
iter 370 value 249.607785
iter 380 value 248.975537
iter 390 value 248.219764
iter 400 value 247.500930
iter 410 value 246.681021
iter 420 value 244.770506
iter 430 value 243.065753
iter 440 value 241.383616
iter 450 value 240.004739
iter 460 value 239.053389
iter 470 value 238.005530
iter 480 value 237.003810
iter 490 value 236.131772
iter 500 value 235.369089
iter 510 value 234.978214
iter 520 value 234.683714
iter 530 value 234.423487
iter 540 value 234.247601
iter 550 value 234.105953
iter 560 value 233.977452
iter 570 value 233.874893
iter 580 value 233.782822
iter 590 value 233.712279
iter 600 value 233.642088
iter 610 value 233.573397
iter 620 value 233.496652
iter 630 value 233.452077
iter 640 value 233.426451
iter 650 value 233.417056
iter 660 value 233.411401
iter 670 value 233.407564
iter 680 value 233.405722
iter 690 value 233.404789
iter 700 value 233.404280
iter 710 value 233.403904
iter 720 value 233.403384
iter 730 value 233.400356
iter 740 value 233.387943
iter 750 value 233.370376
iter 760 value 233.360311
iter 770 value 233.357017
iter 780 value 233.355794
iter 790 value 233.355327
iter 800 value 233.355113
iter 810 value 233.354995
iter 820 value 233.354934
final  value 233.354916 
converged
elapsed: 24.846000000005
Neural Network 

951 samples
125 predictors

Pre-processing: centered (125), scaled (125) 
Resampling: Cross-Validated (2 fold) 
Summary of sample sizes: 475, 476 
Resampling results across tuning parameters:

  size  decay  RMSE       Rsquared 
  3      0.5   0.8757235  0.8245615
  3      1.5   0.7818142  0.8570045
  3      5.0   0.7392560  0.8707778
  5      0.5   0.8027845  0.8519641
  5      1.5   0.7152059  0.8800886
  5      3.0   0.7066997  0.8820685
  5      5.0   0.6895219  0.8877755
  5      9.0   0.7160121  0.8793997
  5     27.0   0.8417813  0.8463226
  7      3.0   0.6786068  0.8917519
  7      9.0   0.6994337  0.8847423
  7     27.0   0.8206408  0.8510602

RMSE was used to select the optimal model using  the smallest value.
The final values used for the model were size = 7 and decay = 3.

We're getting close, let's concentrate on 5 and 7 nodes and the decay range 1 to 8

In [355]:
nnetGrid4 <- expand.grid(size = c(5, 7), decay = c(1, 2, 4, 8))
    
t(nnetGrid4)
size57575757
decay11224488
In [356]:
set.seed(100)
starttime <- proc.time()[3]
nnetTune4 <- train(x = trainingData[,-126], y = trainingData[,126],
                  method = "nnet",
                  tuneGrid = nnetGrid4,
                  trControl = ctrl2,
                  preProc = c("center", "scale"),
                  linout = TRUE,
                  trace = TRUE,
                  MaxNWts = 11 * (ncol(testingData) + 1) + 11 + 1,
                  maxit = 1500,
                  allowParallel = FALSE)
proc.time()[3] - starttime
nnetTune4

plot(nnetTune4)
# weights:  890
initial  value 13687.184090 
iter  10 value 4340.000822
iter  20 value 3462.722302
iter  30 value 2889.162930
iter  40 value 2312.655226
iter  50 value 1517.479913
iter  60 value 909.764856
iter  70 value 689.635802
iter  80 value 601.695253
iter  90 value 555.565791
iter 100 value 497.625328
iter 110 value 446.854046
iter 120 value 423.436391
iter 130 value 390.163740
iter 140 value 369.536799
iter 150 value 353.680325
iter 160 value 344.383264
iter 170 value 334.479794
iter 180 value 326.999471
iter 190 value 320.011663
iter 200 value 313.901725
iter 210 value 307.921205
iter 220 value 302.427256
iter 230 value 298.673667
iter 240 value 295.380709
iter 250 value 292.787357
iter 260 value 290.842884
iter 270 value 288.353089
iter 280 value 286.664995
iter 290 value 284.662128
iter 300 value 282.937521
iter 310 value 281.036134
iter 320 value 279.540674
iter 330 value 278.219216
iter 340 value 277.301060
iter 350 value 276.593597
iter 360 value 275.912554
iter 370 value 275.296962
iter 380 value 274.732375
iter 390 value 274.268198
iter 400 value 273.779559
iter 410 value 273.375013
iter 420 value 272.990008
iter 430 value 272.711352
iter 440 value 272.416575
iter 450 value 272.201388
iter 460 value 271.964092
iter 470 value 271.649409
iter 480 value 271.296924
iter 490 value 270.963520
iter 500 value 270.577132
iter 510 value 270.196477
iter 520 value 269.874235
iter 530 value 269.606438
iter 540 value 269.268580
iter 550 value 269.054566
iter 560 value 268.903192
iter 570 value 268.802754
iter 580 value 268.722981
iter 590 value 268.648915
iter 600 value 268.614954
iter 610 value 268.577757
iter 620 value 268.557742
iter 630 value 268.540196
iter 640 value 268.524315
iter 650 value 268.489246
iter 660 value 268.406256
iter 670 value 268.315831
iter 680 value 268.221581
iter 690 value 268.120474
iter 700 value 268.057446
iter 710 value 268.013457
iter 720 value 267.983041
iter 730 value 267.965077
iter 740 value 267.956383
iter 750 value 267.952080
iter 760 value 267.949377
iter 770 value 267.948241
iter 780 value 267.947655
iter 790 value 267.947304
iter 800 value 267.946839
iter 810 value 267.946352
iter 820 value 267.945217
iter 830 value 267.943832
iter 840 value 267.943199
iter 850 value 267.942763
iter 860 value 267.942533
iter 870 value 267.942426
iter 880 value 267.942360
iter 880 value 267.942357
iter 880 value 267.942357
final  value 267.942357 
converged
elapsed: 23.1880000000019
Neural Network 

951 samples
125 predictors

Pre-processing: centered (125), scaled (125) 
Resampling: Cross-Validated (2 fold) 
Summary of sample sizes: 475, 476 
Resampling results across tuning parameters:

  size  decay  RMSE       Rsquared 
  5     1      0.7864647  0.8566869
  5     2      0.7329654  0.8749357
  5     4      0.6960195  0.8861047
  5     8      0.7047492  0.8831901
  7     1      0.7509099  0.8701415
  7     2      0.7057734  0.8834655
  7     4      0.6940575  0.8869245
  7     8      0.6946224  0.8863640

RMSE was used to select the optimal model using  the smallest value.
The final values used for the model were size = 7 and decay = 4.
In [357]:
nnetGrid5 <- expand.grid(size = 7, decay = c(5, 8))
    
t(nnetGrid5)
size77
decay57
In [358]:
set.seed(100)
starttime <- proc.time()[3]
nnetTune5 <- train(x = trainingData[,-126], y = trainingData[,126],
                  method = "nnet",
                  tuneGrid = nnetGrid5,
                  trControl = ctrl,  # back to 4-fold CV
                  preProc = c("center", "scale"),
                  linout = TRUE,
                  trace = TRUE,
                  MaxNWts = 13 * (ncol(testingData) + 1) + 13 + 1,
                  maxit = 2000,
                  allowParallel = FALSE)
proc.time()[3] - starttime
nnetTune5

plot(nnetTune5)
# weights:  890
initial  value 23064.197676 
iter  10 value 5768.817089
iter  20 value 4644.092554
iter  30 value 4004.849116
iter  40 value 2227.248502
iter  50 value 1060.475579
iter  60 value 882.858899
iter  70 value 790.013336
iter  80 value 660.153395
iter  90 value 616.424436
iter 100 value 579.504926
iter 110 value 536.389689
iter 120 value 517.130541
iter 130 value 475.310541
iter 140 value 465.677924
iter 150 value 453.998520
iter 160 value 442.346408
iter 170 value 431.856854
iter 180 value 421.887182
iter 190 value 412.367482
iter 200 value 404.905200
iter 210 value 399.017058
iter 220 value 395.948928
iter 230 value 393.963787
iter 240 value 392.097611
iter 250 value 390.604581
iter 260 value 389.482032
iter 270 value 388.643233
iter 280 value 387.356351
iter 290 value 386.254214
iter 300 value 385.310875
iter 310 value 384.661167
iter 320 value 384.111859
iter 330 value 383.742560
iter 340 value 383.455138
iter 350 value 383.262340
iter 360 value 383.081664
iter 370 value 382.964566
iter 380 value 382.848389
iter 390 value 382.780381
iter 400 value 382.730000
iter 410 value 382.694168
iter 420 value 382.673764
iter 430 value 382.659948
iter 440 value 382.649963
iter 450 value 382.643274
iter 460 value 382.638744
iter 470 value 382.636293
iter 480 value 382.634625
iter 490 value 382.633019
iter 500 value 382.631792
iter 510 value 382.630829
iter 520 value 382.629964
iter 530 value 382.629402
iter 540 value 382.629083
iter 550 value 382.628930
iter 550 value 382.628926
final  value 382.628900 
converged
elapsed: 14.7289999999994
Neural Network 

951 samples
125 predictors

Pre-processing: centered (125), scaled (125) 
Resampling: Cross-Validated (4 fold) 
Summary of sample sizes: 714, 712, 713, 714 
Resampling results across tuning parameters:

  decay  RMSE       Rsquared 
  5      0.6761401  0.8915487
  7      0.6717117  0.8935463

Tuning parameter 'size' was held constant at a value of 7
RMSE was used to select the optimal model using  the smallest value.
The final values used for the model were size = 7 and decay = 7.
In [361]:
testResults$NNet = predict(nnetTune5, testingData)
cor(testResults[,c(1,8:11)])
obsRidgeEnetEnet2NNet
obs1.00000000.93880130.94136240.94079790.9407308
Ridge0.93880131.00000000.99630900.99735600.9866612
Enet0.94136240.99630901.00000000.99962740.9875520
Enet20.94079790.99735600.99962741.00000000.9875379
NNet0.94073080.98666120.98755200.98753791.0000000

To use another implementation avNNet, the correlated predictors should be removed.

In [362]:
# (C) Kuhn and Johnson, 2013
tooHigh <- findCorrelation(cor(solTrainXtrans), cutoff=.75)
trainXnnet <- solTrainXtrans[, -tooHigh]
testXnnet <- solTestXtrans[, -tooHigh]
In [363]:
startTime <- proc.time()[3]
# nnetGrid <- expand.grid(decay = c(0, 0.01, .1), 
#                         size = c(1, 3, 5, 7, 9, 11, 13), 
#                         bag = TRUE)
broadGrid <- expand.grid(decay = c(1, 10), 
                        size = 7, 
                        bag = TRUE)
broadGrid

# set.seed(100)
# when I mistakenly ran on solTrainXtrans, only [1.5, 5] completed without NA
#     so I assume using trainXnnet for avNNet is required
avNNetTune <- train(trainXnnet, solTrainY, method="avNNet",
                    tuneGrid = broadGrid, preProc = c("center", "scale"),
                    linout = TRUE, trace = TRUE, 
                    MaxNWts = 9*(ncol(trainXnnet) + 1) + 9 + 1,
                    maxit = 50)

cat(sprintf('Total time %.1f', proc.time()[3]-startTime))
avNNetTune
decaysizebag
1 7 TRUE
10 7 TRUE
Total time 45.2
Model Averaged Neural Network 

951 samples
141 predictors

Pre-processing: centered (141), scaled (141) 
Resampling: Bootstrapped (25 reps) 
Summary of sample sizes: 951, 951, 951, 951, 951, 951, ... 
Resampling results across tuning parameters:

  decay  RMSE       Rsquared 
   1     0.9219795  0.8003117
  10     0.8588451  0.8298753

Tuning parameter 'size' was held constant at a value of 7
Tuning
 parameter 'bag' was held constant at a value of TRUE
RMSE was used to select the optimal model using  the smallest value.
The final values used for the model were size = 7, decay = 10 and bag = TRUE.

( ! ) We can see that avNNet takes a lot of time. I'll defer to the experts on implementing this method...

In [127]:
# Bagging doesn't take more time but seems to give poorer results...

startTime <- proc.time()[3]
# nnetGrid <- expand.grid(decay = c(0, 0.01, .1), 
#                         size = c(1, 3, 5, 7, 9, 11, 13), 
#                         bag = TRUE)
broadGrid <- expand.grid(decay = c(0.1, 1, 10), 
                        size = c(7, 3, 11), 
                        bag = FALSE)
broadGrid = broadGrid[c(1:3,5,8),]
broadGrid

# set.seed(100)
# when I mistakenly ran on solTrainXtrans, only [1.5, 5] completed without NA
#     so I assume using trainXnnet for avNNet is required (remove correlated var.)
avNNetTune2 <- train(trainXnnet, solTrainY, method="avNNet",
                    tuneGrid = broadGrid, preProc = c("center", "scale"),
                    linout = TRUE, trace = TRUE, 
                    MaxNWts = 10*(ncol(trainXnnet) + 1) + 10 + 1,
                    maxit = 100)

cat(sprintf('Total time %.1f', proc.time()[3]-startTime))
avNNetTune2
decaysizebag
1 0.1 7 FALSE
2 1.0 7 FALSE
310.0 7 FALSE
5 1.0 3 FALSE
8 1.0 11 FALSE
Warning message in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
“There were missing values in resampled performance measures.”Warning message in train.default(trainXnnet, solTrainY, method = "avNNet", tuneGrid = broadGrid, :
“missing values found in aggregated results”
Total time 144.6
Model Averaged Neural Network 

951 samples
141 predictors

Pre-processing: centered (141), scaled (141) 
Resampling: Bootstrapped (25 reps) 
Summary of sample sizes: 951, 951, 951, 951, 951, 951, ... 
Resampling results across tuning parameters:

  decay  size  RMSE       Rsquared 
   0.1    7    0.8863794  0.8173240
   1.0    3    0.8700260  0.8207401
   1.0    7    0.8244269  0.8401613
   1.0   11          NaN        NaN
  10.0    7    0.7979998  0.8490339

Tuning parameter 'bag' was held constant at a value of FALSE
RMSE was used to select the optimal model using  the smallest value.
The final values used for the model were size = 7, decay = 10 and bag = FALSE.
In [128]:
# checking to see changes in results with 1000 iterations for 3 cases in avNNetTune
startTime <- proc.time()[3]
# nnetGrid <- expand.grid(decay = c(0, 0.01, .1), 
#                         size = c(1, 3, 5, 7, 9, 11, 13), 
#                         bag = TRUE)
broadGrid <- expand.grid(decay = c(0.1, 1, 10), 
                        size = c(7, 3, 11), 
                        bag = TRUE)
broadGrid = broadGrid[c(1,3,5),]
broadGrid

# set.seed(100)
# when I mistakenly ran on solTrainXtrans, only [1.5, 5] completed without NA
#     so I assume using trainXnnet for avNNet is required
avNNetTune3 <- train(trainXnnet, solTrainY, method="avNNet",
                    tuneGrid = broadGrid, preProc = c("center", "scale"),
                    linout = TRUE, trace = TRUE, 
                    MaxNWts = 10*(ncol(trainXnnet) + 1) + 10 + 1,
                    maxit = 1000)

cat(sprintf('Total time %.1f', proc.time()[3]-startTime))
avNNetTune3
decaysizebag
1 0.17 TRUE
310.07 TRUE
5 1.03 TRUE
Total time 739.6
Model Averaged Neural Network 

951 samples
141 predictors

Pre-processing: centered (141), scaled (141) 
Resampling: Bootstrapped (25 reps) 
Summary of sample sizes: 951, 951, 951, 951, 951, 951, ... 
Resampling results across tuning parameters:

  decay  size  RMSE       Rsquared 
   0.1   7     0.8072921  0.8479810
   1.0   3     0.8231635  0.8413128
  10.0   7     0.8095019  0.8437665

Tuning parameter 'bag' was held constant at a value of TRUE
RMSE was used to select the optimal model using  the smallest value.
The final values used for the model were size = 7, decay = 0.1 and bag = TRUE.

The maxit = 100 results were:

  decay  size  RMSE       Rsquared 
   0.1    7    0.9321281  0.8002468
   1.0    3    0.8950329  0.8110429
  10.0    7    0.8151765  0.8436004
In [137]:
# may not be the most optimal but should give good results
broadGrid <- expand.grid(decay = c(0.5), 
                        size = c(6, 8), 
                        bag = TRUE)
set.seed(100)
startTime <- proc.time()[3]
avNNetTuneF <- train(trainXnnet, solTrainY, method="avNNet",
                    tuneGrid = broadGrid, preProc = c("center", "scale"),
                    linout = TRUE, trace = TRUE, 
                    MaxNWts = 10*(ncol(trainXnnet) + 1) + 10 + 1,
                    maxit = 1500)
cat(sprintf('Total time %.1f', proc.time()[3]-startTime))
avNNetTuneF
Total time 1004.7
Model Averaged Neural Network 

951 samples
141 predictors

Pre-processing: centered (141), scaled (141) 
Resampling: Bootstrapped (25 reps) 
Summary of sample sizes: 951, 951, 951, 951, 951, 951, ... 
Resampling results across tuning parameters:

  size  RMSE       Rsquared 
  6     0.7967607  0.8534171
  8     0.7795084  0.8585573

Tuning parameter 'decay' was held constant at a value of 0.5
Tuning
 parameter 'bag' was held constant at a value of TRUE
RMSE was used to select the optimal model using  the smallest value.
The final values used for the model were size = 8, decay = 0.5 and bag = TRUE.
In [139]:
#testResults$avNNet = predict(avNNetTuneF, testXnnet)
plotsize(7,4)
plot(avNNetTuneF)

Multivariate Adaptive Regression Splines

Splines are a flexible model to fit a nonlinear regression function ($E(Y|X)$). MARS uses a collection of hockey-stick functions (hinge functions) to fit the response surface: $$h(x) = x\, \mathbf{I}[x > 0]$$ The nodes for the individual hinge functions are actually $h(\mathbf{x} - \mathbf{c})$ (the points $\mathbf{c}$ where the hinge functions start increasing), must be estimated along with the coefficients for how the $h(x)$ fit the regression function.

In [368]:
# (C) Kuhn and Johnson (2013)
################################################################################
### Section 7.2 Multivariate Adaptive Regression Splines

set.seed(100)
# first a wide sparse tuneGrid
marsTune0 <- train(x = solTrainXtrans, y = solTrainY,
                  method = "earth",
                  tuneGrid = expand.grid(degree = c(1,2,3), nprune = c(3,9,27,81)),
                  trControl = ctrl)

marsTune0
plot(marsTune0)
Multivariate Adaptive Regression Spline 

951 samples
228 predictors

No pre-processing
Resampling: Cross-Validated (4 fold) 
Summary of sample sizes: 714, 712, 713, 714 
Resampling results across tuning parameters:

  degree  nprune  RMSE       Rsquared 
  1        3      1.2032828  0.6508545
  1        9      0.8711956  0.8205688
  1       27      0.7200438  0.8791417
  1       81      0.7032375  0.8846252
  2        3      1.1146563  0.7049950
  2        9      0.8278346  0.8385735
  2       27      0.7332688  0.8743454
  2       81      0.7381761  0.8713007
  3        3      1.1138595  0.7056060
  3        9      0.8655595  0.8225751
  3       27      0.7478830  0.8727233
  3       81      0.7306112  0.8787719

RMSE was used to select the optimal model using  the smallest value.
The final values used for the model were nprune = 81 and degree = 1.
In [372]:
set.seed(100)
# expanding tuneGrid
marsTune1 <- train(x = solTrainXtrans, y = solTrainY,
                  method = "earth",
                  tuneGrid = expand.grid(degree = c(1,2,3), nprune = c(60,120)),
                  #tuneGrid = expand.grid(degree = c(1,2,3), nprune = round(30*c(1,1.25,1.25^2,1.25^3))),
                  trControl = ctrl)
marsTune1

plot(marsTune1)
Multivariate Adaptive Regression Spline 

951 samples
228 predictors

No pre-processing
Resampling: Cross-Validated (4 fold) 
Summary of sample sizes: 714, 712, 713, 714 
Resampling results across tuning parameters:

  degree  nprune  RMSE       Rsquared 
  1        60     0.7032375  0.8846252
  1       120     0.7032375  0.8846252
  2        60     0.7381761  0.8713007
  2       120     0.7381761  0.8713007
  3        60     0.7306112  0.8787719
  3       120     0.7306112  0.8787719

RMSE was used to select the optimal model using  the smallest value.
The final values used for the model were nprune = 60 and degree = 1.
In [379]:
set.seed(100)
# finer tuneGrid
tuneval <- expand.grid(nprune = 10*(3:9), degree = c(2,1,3))
tuneval = tuneval[c(1:8,10,12,16,19),]
t(tuneval)
1234567810121619
nprune304050607080903050704070
degree 2 2 2 2 2 2 2 1 1 1 3 3
In [380]:
marsTune <- train(x = solTrainXtrans, y = solTrainY,
                  method = "earth",
                  tuneGrid = tuneval,
                  trControl = ctrl20)
marsTune
# takes two minutes (but worth it)
Multivariate Adaptive Regression Spline 

951 samples
228 predictors

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 903, 904, 903, 903, 904, 904, ... 
Resampling results across tuning parameters:

  degree  nprune  RMSE       Rsquared 
  1       30      0.7060970  0.8813708
  1       50      0.6919723  0.8859480
  1       70      0.6919723  0.8859480
  2       30      0.7141661  0.8791528
  2       40      0.7115677  0.8801405
  2       50      0.7115677  0.8801405
  2       60      0.7115677  0.8801405
  2       70      0.7115677  0.8801405
  2       80      0.7115677  0.8801405
  2       90      0.7115677  0.8801405
  3       40      0.6741254  0.8929622
  3       70      0.6743365  0.8927875

RMSE was used to select the optimal model using  the smallest value.
The final values used for the model were nprune = 40 and degree = 3.
In [381]:
plot(marsTune)
In [382]:
summary(marsTune)
Call: earth(x=data.frame[951,228], y=c(-3.97,-3.98,-...), keepxy=TRUE,
            degree=3, nprune=40)

                                                                        coefficients
(Intercept)                                                               -6.9262283
FP059                                                                     -0.5332720
FP137                                                                      1.2140197
FP170                                                                      0.4306846
h(5.94458-MolWeight)                                                       5.7923163
h(2.99573-NumNonHAtoms)                                                    4.9812995
h(NumNonHAtoms-2.99573)                                                   -4.1380112
h(1.9554-SurfaceArea1)                                                    -0.1840880
h(SurfaceArea1-1.9554)                                                     0.2447441
FP043 * FP137                                                              0.7052706
FP044 * h(1.9554-SurfaceArea1)                                            -0.3648178
FP081 * h(2.99573-NumNonHAtoms)                                           -0.3827328
FP109 * h(NumNonHAtoms-2.99573)                                            3.2270772
FP137 * h(SurfaceArea2-4.66178)                                           -0.1306682
FP026 * h(MolWeight-5.94458)                                             -26.4803588
h(5.94458-MolWeight) * h(NumNonHAtoms-3.09104)                            27.8769834
h(5.94458-MolWeight) * h(3.09104-NumNonHAtoms)                            -3.8607793
h(5.94458-MolWeight) * h(NumRotBonds-1.94591)                             -4.1394677
h(5.94458-MolWeight) * h(NumAromaticBonds-2.56495)                        -3.7064164
h(5.94458-MolWeight) * h(2.56495-NumAromaticBonds)                        -0.1252909
h(5.94458-MolWeight) * h(NumCarbon-2.06886)                               -1.2078304
h(5.94458-MolWeight) * h(2.06886-NumCarbon)                                0.9478414
h(NumNonHAtoms-2.99573) * h(NumHydrogen-5.62625)                          -1.9196373
h(NumNonHAtoms-2.99573) * h(5.62625-NumHydrogen)                          -1.8639214
h(2.99573-NumNonHAtoms) * h(SurfaceArea2-4.481)                           -0.1647506
h(2.99573-NumNonHAtoms) * h(4.481-SurfaceArea2)                           -0.1832237
h(3.31754-NumHydrogen) * h(1.9554-SurfaceArea1)                           -0.1672895
h(NumHydrogen-3.31754) * h(1.9554-SurfaceArea1)                           -0.3218540
h(4.17781-NumHydrogen) * h(SurfaceArea1-1.9554)                           -0.0899016
h(NumHydrogen-4.17781) * h(SurfaceArea1-1.9554)                            0.0392073
h(1.09861-NumOxygen) * h(SurfaceArea1-1.9554)                             -0.0474872
h(NumOxygen-1.09861) * h(SurfaceArea1-1.9554)                              0.1207537
FP040 * h(1.09861-NumOxygen) * h(SurfaceArea1-1.9554)                      0.1180802
FP099 * h(5.94458-MolWeight) * h(NumAromaticBonds-2.56495)                27.4455248
FP107 * h(1.09861-NumOxygen) * h(SurfaceArea1-1.9554)                     -0.1282781
FP161 * h(5.94458-MolWeight) * h(NumCarbon-2.06886)                        0.7056914
FP176 * h(5.94458-MolWeight) * h(NumCarbon-2.06886)                        0.4111568
h(5.94458-MolWeight) * h(1.94591-NumRotBonds) * h(SurfaceArea2-8.70938)   -0.1122810
h(5.02506-MolWeight) * h(4.17781-NumHydrogen) * h(SurfaceArea1-1.9554)     0.0764015
h(MolWeight-5.02506) * h(4.17781-NumHydrogen) * h(SurfaceArea1-1.9554)     0.0685996

Selected 40 of 44 terms, and 22 of 228 predictors
Termination condition: RSq changed by less than 0.001 at 44 terms
Importance: MolWeight, NumNonHAtoms, SurfaceArea1, FP137, FP043, ...
Number of terms at each degree of interaction: 1 8 23 8
GCV 0.3426774    RSS 261.8748    GRSq 0.9182768    RSq 0.9341908
In [384]:
testResults$MARS <- predict(marsTune, solTestXtrans)
cor(testResults[,c(1,8:12)])
obsRidgeEnetEnet2NNetMARS
obs1.00000000.93880130.94136240.94079790.94073080.9446675
Ridge0.93880131.00000000.99630900.99735600.98666120.9673092
Enet0.94136240.99630901.00000000.99962740.98755200.9701049
Enet20.94079790.99735600.99962741.00000000.98753790.9705076
NNet0.94073080.98666120.98755200.98753791.00000000.9712741
MARS0.94466750.96730920.97010490.97050760.97127411.0000000

The improvement from degree=3 seems to come out of nowhere. This is one of the problems with using a small number of folds, the data are more limited for each fold's predictions. The additional data are helpful in finding degree=3 effects which are obviously real due to the great test set results. To compare with 4-fold results:

In [388]:
marsTune4Fold <- train(x = solTrainXtrans, y = solTrainY,
                  method = "earth",
                  tuneGrid = tuneval,
                  trControl = ctrl)
marsTune4Fold
# takes one minute
plot(marsTune4Fold)
Multivariate Adaptive Regression Spline 

951 samples
228 predictors

No pre-processing
Resampling: Cross-Validated (4 fold) 
Summary of sample sizes: 714, 712, 713, 714 
Resampling results across tuning parameters:

  degree  nprune  RMSE       Rsquared 
  1       30      0.7091055  0.8825701
  1       50      0.7032375  0.8846252
  1       70      0.7032375  0.8846252
  2       30      0.7289917  0.8744521
  2       40      0.7381761  0.8713007
  2       50      0.7381761  0.8713007
  2       60      0.7381761  0.8713007
  2       70      0.7381761  0.8713007
  2       80      0.7381761  0.8713007
  2       90      0.7381761  0.8713007
  3       40      0.7349071  0.8775626
  3       70      0.7306112  0.8787719

RMSE was used to select the optimal model using  the smallest value.
The final values used for the model were nprune = 50 and degree = 1.

An added bonus, predictor importance measures are available from the MARS fits:

In [385]:
marsImp <- varImp(marsTune, scale = FALSE)
plotsize(7,5)
plot(marsImp, top = 25)

Support Vector Machines

Just like neural networks, maybe more so, it is very important to tune SVMs and their many model parameters. Below a large sweep of cost values is shown over several orders of magnitude -- a good idea! SVM model parameters cover a large region unlike some methods.

In [386]:
# (C) Kuhn and Johnson (2013)
################################################################################
### Section 7.3 Support Vector Machines

## In a recent update to caret, the method to estimate the
## sigma parameter was slightly changed. These results will
## slightly differ from the text for that reason.

set.seed(100)
svmRTune <- train(x = solTrainXtrans, y = solTrainY,
                  method = "svmRadial",
                  preProc = c("center", "scale"),
                  tuneLength = 14,
                  trControl = ctrl)
svmRTune
Support Vector Machines with Radial Basis Function Kernel 

951 samples
228 predictors

Pre-processing: centered (228), scaled (228) 
Resampling: Cross-Validated (4 fold) 
Summary of sample sizes: 714, 712, 713, 714 
Resampling results across tuning parameters:

  C        RMSE       Rsquared 
     0.25  0.8516144  0.8527500
     0.50  0.7466833  0.8783793
     1.00  0.6929652  0.8908344
     2.00  0.6682661  0.8963082
     4.00  0.6606083  0.8978305
     8.00  0.6500999  0.9009305
    16.00  0.6457500  0.9020346
    32.00  0.6405546  0.9034572
    64.00  0.6374404  0.9043125
   128.00  0.6346527  0.9049878
   256.00  0.6356789  0.9046221
   512.00  0.6385039  0.9038120
  1024.00  0.6414556  0.9029918
  2048.00  0.6449199  0.9020963

Tuning parameter 'sigma' was held constant at a value of 0.002762998
RMSE was used to select the optimal model using  the smallest value.
The final values used for the model were sigma = 0.002762998 and C = 128.
In [155]:
svmRTune$finalModel
Support Vector Machine object of class "ksvm" 

SV type: eps-svr  (regression) 
 parameter : epsilon = 0.1  cost C = 256 

Gaussian Radial Basis kernel function. 
 Hyperparameter : sigma =  0.00276299791797444 

Number of Support Vectors : 638 

Objective Function Value : -1122.876 
Training error : 0.009275 
In [156]:
plot(svmRTune, scales = list(x = list(log = 2)))
In [169]:
# but can sigma be tuned in the radial basis kernel?
Rtuneval <- expand.grid(sigma = c(.0027, .0009, .0081), C = 5^(-1:4))

set.seed(100)
svmRTune1 <- train(x = solTrainXtrans, y = solTrainY,
                  method = "svmRadial",
                  preProc = c("center", "scale"),
                  tuneGrid = Rtuneval,
                  trControl = ctrl)
svmRTune1
Support Vector Machines with Radial Basis Function Kernel 

951 samples
228 predictors

Pre-processing: centered (228), scaled (228) 
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 856, 857, 855, 856, 856, 855, ... 
Resampling results across tuning parameters:

  sigma   C      RMSE       Rsquared 
  0.0009    0.2  1.0053957  0.8108242
  0.0009    1.0  0.7195838  0.8820746
  0.0009    5.0  0.6263986  0.9062942
  0.0009   25.0  0.6133773  0.9099953
  0.0009  125.0  0.6183704  0.9087862
  0.0009  625.0  0.6338237  0.9047233
  0.0027    0.2  0.8493628  0.8536611
  0.0027    1.0  0.6606906  0.8983515
  0.0027    5.0  0.6128332  0.9103894
  0.0027   25.0  0.5998045  0.9138249
  0.0027  125.0  0.5980350  0.9142175
  0.0027  625.0  0.5988361  0.9141228
  0.0081    0.2  0.9854832  0.8116332
  0.0081    1.0  0.7711909  0.8653243
  0.0081    5.0  0.7283495  0.8779307
  0.0081   25.0  0.7176235  0.8806729
  0.0081  125.0  0.7136537  0.8821503
  0.0081  625.0  0.7160022  0.8811837

RMSE was used to select the optimal model using  the smallest value.
The final values used for the model were sigma = 0.0027 and C = 125.
In [170]:
plot(svmRTune1, scales = list(x = list(log = 2)))

Choice of sigma seems adequate, maybe I'd rerun with sigma = .002.

...But it's always good to know where the overfitting scale is for each parameter, and reducing sigma exposes us to overfitting the particulars in our training data, just like choosing $k=1$ in nearest neighbors is exposed to overfitting. That said, sigma = .0027 might be just right.

In [158]:
svmGrid <- expand.grid(degree = 1:2, 
                       scale = c(0.025, 0.005, 0.001), 
                       C = 10^(-1:2))
set.seed(100)
svmPTune <- train(x = solTrainXtrans, y = solTrainY,
                  method = "svmPoly",
                  preProc = c("center", "scale"),
                  tuneGrid = svmGrid,
                  trControl = ctrl)

svmPTune
Support Vector Machines with Polynomial Kernel 

951 samples
228 predictors

Pre-processing: centered (228), scaled (228) 
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 856, 857, 855, 856, 856, 855, ... 
Resampling results across tuning parameters:

  degree  scale  C      RMSE       Rsquared 
  1       0.001    0.1  1.3540284  0.7243821
  1       0.001    1.0  0.7925187  0.8582787
  1       0.001   10.0  0.6873021  0.8869297
  1       0.001  100.0  0.7283714  0.8737044
  1       0.005    0.1  0.8933684  0.8341428
  1       0.005    1.0  0.6935162  0.8853703
  1       0.005   10.0  0.7131621  0.8784904
  1       0.005  100.0  0.7379704  0.8704753
  1       0.025    0.1  0.7185737  0.8784253
  1       0.025    1.0  0.6984103  0.8832000
  1       0.025   10.0  0.7366633  0.8707496
  1       0.025  100.0  0.7422996  0.8694461
  2       0.001    0.1  1.1020197  0.7830031
  2       0.001    1.0  0.7008308  0.8854851
  2       0.001   10.0  0.6279474  0.9053654
  2       0.001  100.0  0.6480095  0.9001255
  2       0.005    0.1  0.7212836  0.8830798
  2       0.005    1.0  0.6166303  0.9088087
  2       0.005   10.0  0.6176591  0.9088929
  2       0.005  100.0  0.6403590  0.9028868
  2       0.025    0.1  0.6311596  0.9041084
  2       0.025    1.0  0.6243454  0.9060953
  2       0.025   10.0  0.6336443  0.9036024
  2       0.025  100.0  0.6778603  0.8906129

RMSE was used to select the optimal model using  the smallest value.
The final values used for the model were degree = 2, scale = 0.005 and C = 1.
In [159]:
plot(svmPTune, 
     scales = list(x = list(log = 2), 
                   between = list(x = .5, y = 1)))

Selecting tuning parameters

Which values for degree, scale, and cost did the program select? Are these the ones you would select?

  • From above, we see that the scale parameter is very important. Set the cost too low for a given scale, and the RMSE rapidly increases.
  • Choosing a larger scale helps reduce our exposure to mistakenly selecting a cost that's too low. We aren't sure how performance will be on the whole data, these are estimates based on our cross-validation results, so we have to have some "wiggle room" for the CV estimates being off.
  • Choosing a scale of 0.025 may not achieve the lowest RMSE, but I like degree=2 scale=0.025 and cost=1 based on these CV results.
  • Let's choose a finer scale and see if degree=3 will compute in a reasonable time. Based on the similarity between degree 1 and 2 results, we have some confidence in our parameters choices for degree=3 to estimate its performance
In [166]:
svmGrid <- expand.grid( scale = c(0.02, 0.01), 
                       C = 4^(-1:1), degree = c(2,3))
svmGrid = svmGrid[c(1:6,9,12),]
# for degree 3: choosing C=4 with scale=0.01 (not too low) and C=1 with scale=0.02
svmGrid
scaleCdegree
10.020.252
20.010.252
30.021.002
40.011.002
50.024.002
60.014.002
90.021.003
120.014.003
In [167]:
set.seed(100)
svmPTune <- train(x = solTrainXtrans, y = solTrainY,
                  method = "svmPoly",
                  preProc = c("center", "scale"),
                  tuneGrid = svmGrid,
                  trControl = ctrl)

svmPTune
Support Vector Machines with Polynomial Kernel 

951 samples
228 predictors

Pre-processing: centered (228), scaled (228) 
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 856, 857, 855, 856, 856, 855, ... 
Resampling results across tuning parameters:

  scale  C     degree  RMSE       Rsquared 
  0.01   0.25  2       0.6209793  0.9075864
  0.01   1.00  2       0.6111335  0.9101010
  0.01   4.00  2       0.6121998  0.9100580
  0.01   4.00  3       0.6096875  0.9104752
  0.02   0.25  2       0.6211801  0.9068750
  0.02   1.00  2       0.6179583  0.9079720
  0.02   1.00  3       0.6436786  0.9011521
  0.02   4.00  2       0.6248520  0.9061891

RMSE was used to select the optimal model using  the smallest value.
The final values used for the model were degree = 3, scale = 0.01 and C = 4.
In [168]:
plot(svmPTune, 
     scales = list(x = list(log = 2), 
                   between = list(x = .5, y = 1)))

Based on these CV results, I choose scale=0.01 (better than 0.005), Cost=2, degree=2 (not worth jumping the complexity to degree 3 for almost no benefit)

In [178]:
svmPreProc <- preProcess(solTrainXtrans, c("center","scale"))
solTrainXsvm <- predict(svmPreProc, solTrainXtrans)
solTestXsvm <- predict(svmPreProc, solTestXtrans)
In [179]:
set.seed(100)
svmPTuneF <- svm(x = solTrainXsvm, y = solTrainY,
                   C=2, degree=2, scale=0.01, kernel="polynomial",
                  preProc = c("center", "scale"))
svmPTuneF
Warning message in any(scale):
“coercing argument of type 'double' to logical”Warning message in any(object$scaled):
“coercing argument of type 'double' to logical”Warning message in any(object$scaled):
“coercing argument of type 'double' to logical”
Call:
svm.default(x = solTrainXsvm, y = solTrainY, scale = 0.01, kernel = "polynomial", 
    degree = 2, C = 2, preProc = c("center", "scale"))


Parameters:
   SVM-Type:  eps-regression 
 SVM-Kernel:  polynomial 
       cost:  1 
     degree:  2 
      gamma:  0.004385965 
     coef.0:  0 
    epsilon:  0.1 


Number of Support Vectors:  722
In [200]:
testResults$SVMr <- predict(svmRTune, solTestXtrans)
testResults$SVMp <- predict(svmPTuneF, solTestXsvm)
Warning message in any(object$scaled):
“coercing argument of type 'double' to logical”Warning message in any(object$scaled):
“coercing argument of type 'double' to logical”

K-Nearest Neighbors

In [192]:
# (C) Kuhn and Johnson (2013)
################################################################################
### Section 7.4 K-Nearest Neighbors

### First we remove near-zero variance predictors
knnDescr <- solTrainXtrans[, -nearZeroVar(solTrainXtrans)]

set.seed(100)
knnTune <- train(x = knnDescr, y = solTrainY,
                 method = "knn",
                 preProc = c("center", "scale"),
                 tuneGrid = data.frame(k = 1:20),
                 trControl = ctrl)
knnTune
testResults$Knn <- predict(knnTune, solTestXtrans[, names(knnDescr)])
k-Nearest Neighbors 

951 samples
225 predictors

Pre-processing: centered (225), scaled (225) 
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 856, 857, 855, 856, 856, 855, ... 
Resampling results across tuning parameters:

  k   RMSE      Rsquared 
   1  1.229745  0.6710823
   2  1.102057  0.7233931
   3  1.057972  0.7407611
   4  1.041420  0.7465429
   5  1.056222  0.7377971
   6  1.051150  0.7391613
   7  1.052694  0.7388048
   8  1.045468  0.7420267
   9  1.054200  0.7375261
  10  1.060630  0.7348148
  11  1.064501  0.7332010
  12  1.076157  0.7270332
  13  1.084877  0.7226715
  14  1.089678  0.7201092
  15  1.098308  0.7153520
  16  1.107097  0.7108917
  17  1.114296  0.7075958
  18  1.122538  0.7037828
  19  1.129979  0.7002080
  20  1.136227  0.6970077

RMSE was used to select the optimal model using  the smallest value.
The final value used for the model was k = 4.
In [185]:
plot(knnTune)
In [193]:
# I'd use k=8 based on those CV results
set.seed(100)
indx50 <- createFolds(solTrainY, returnTrain = TRUE, k=50)
ctrl50 <- trainControl(method = "cv", index = indx50)

set.seed(100)
knnTuneF <- train(x = knnDescr, y = solTrainY,
                 method = "knn",
                 preProc = c("center", "scale"),
                 tuneGrid = data.frame(k = 6:10),
                 trControl = ctrl50)
knnTuneF
k-Nearest Neighbors 

951 samples
225 predictors

Pre-processing: centered (225), scaled (225) 
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 932, 934, 931, 935, 931, 932, ... 
Resampling results across tuning parameters:

  k   RMSE      Rsquared 
   6  1.037070  0.7356350
   7  1.043405  0.7338611
   8  1.025509  0.7411777
   9  1.038036  0.7377512
  10  1.045266  0.7335877

RMSE was used to select the optimal model using  the smallest value.
The final value used for the model was k = 8.
In [191]:
dim(knnDescr)
head(knnDescr)
nearZeroVar(solTrainXtrans)
dim(solTestXtrans)
  1. 951
  2. 225
FP001FP002FP003FP004FP005FP006FP007FP008FP009FP010NumCarbonNumNitrogenNumOxygenNumSulferNumChlorineNumHalogenNumRingsHydrophilicFactorSurfaceArea1SurfaceArea2
6610 1 0 0 1 0 0 1 0 0 4.177811 0.5848146 0.0000000 0.000 0.0000000 0.0000000 1.3862944 -1.606541816.812456 6.812456
6620 1 0 1 1 1 1 1 0 0 5.092358 0.6423550 0.6931472 0.375 0.0000000 0.0000000 1.6094379 -0.441330439.753834 12.029604
6631 1 1 1 1 0 0 1 0 1 4.023944 0.0000000 1.0986123 0.000 0.0000000 0.0000000 0.6931472 -0.384859108.245324 8.245324
6650 0 1 0 0 0 1 0 0 0 3.510455 0.0000000 0.0000000 0.000 0.0000000 0.0000000 0.6931472 -2.373472200.000000 0.000000
6680 0 1 1 1 1 0 0 1 0 3.317541 0.6943345 0.0000000 0.000 0.3750000 0.3750000 0.6931472 -0.070987269.913535 9.913535
6691 0 1 1 0 0 0 0 1 0 3.510455 0.4568260 0.6931472 0.375 0.4444444 0.4444444 0.0000000 -0.949253275.999109 9.123359
  1. 154
  2. 199
  3. 200
  1. 316
  2. 228
In [195]:
testResults$Knn <- predict(knnTuneF, solTestXtrans[, names(knnDescr)])

Summary of nonlinear regression methods

In [202]:
cat("Training Errors:\n")
fullpred <- predict(enetTune, solTrainXtrans)
cat('Enet:  ',1 - sum((fullpred - solTrainY)^2) / sum((solTrainY - mean(solTrainY))^2),": ")
fullpred <- predict(nnetTune, solTrainXtrans)
cat('NNet:  ',1 - sum((fullpred - solTrainY)^2) / sum((solTrainY - mean(solTrainY))^2),": ")
fullpred <- predict(avNNetTuneF, trainXnnet)
cat('avNNet:',1 - sum((fullpred - solTrainY)^2) / sum((solTrainY - mean(solTrainY))^2),":\n")
fullpred <- predict(marsTune, solTrainXtrans)
cat('MARS:  ',1 - sum((fullpred - solTrainY)^2) / sum((solTrainY - mean(solTrainY))^2),": ")
fullpred <- predict(svmRTune, solTrainXtrans)
cat('SVMr:  ',1 - sum((fullpred - solTrainY)^2) / sum((solTrainY - mean(solTrainY))^2)," : ")
fullpred <- predict(svmPTune, solTrainXtrans)
cat('SVMp:  ',1 - sum((fullpred - solTrainY)^2) / sum((solTrainY - mean(solTrainY))^2),": ")
fullpred <- predict(knnTuneF, solTrainXtrans)
cat('kNN:   ',1 - sum((fullpred - solTrainY)^2) / sum((solTrainY - mean(solTrainY))^2),"\n")
cat("Testing Errors:\n") 
fullpred <- predict(enetTune, solTestXtrans)
cat('Enet:  ',1 - sum((fullpred - solTestY)^2) / sum((solTestY - mean(solTestY))^2)," : ")
fullpred <- predict(nnetTune, solTestXtrans)
cat('NNet:  ',1 - sum((fullpred - solTestY)^2) / sum((solTestY - mean(solTestY))^2),": ")
fullpred <- predict(avNNetTuneF, testXnnet)
cat('avNNet:',1 - sum((fullpred - solTestY)^2) / sum((solTestY - mean(solTestY))^2),":\n")
fullpred <- predict(marsTune, solTestXtrans)
cat('MARS:  ',1 - sum((fullpred - solTestY)^2) / sum((solTestY - mean(solTestY))^2),": ")
fullpred <- predict(svmRTune, solTestXtrans)
cat('SVMr:  ',1 - sum((fullpred - solTestY)^2) / sum((solTestY - mean(solTestY))^2),": ")
fullpred <- predict(svmPTune, solTestXtrans)
cat('SVMp:  ',1 - sum((fullpred - solTestY)^2) / sum((solTestY - mean(solTestY))^2),": ")
fullpred <- testResults$Knn
cat('kNN:  ',1 - sum((fullpred - solTestY)^2) / sum((solTestY - mean(solTestY))^2),"\n")
Training Errors:
Enet:   0.9227276 : NNet:   0.9754285 : avNNet: 0.9688247 :
MARS:   0.9189357 : SVMr:   0.9907156  : SVMp:   0.9907598 : kNN:    0.7963496 
Testing Errors:
Enet:   0.885803  : NNet:   0.8863654 : avNNet: 0.8607299 :
MARS:   0.8786158 : SVMr:   0.9144164 : SVMp:   0.9111115 : kNN:   0.7349325 
In [203]:
# I will ignore the additional tuning for SVMp (didn't work)
testResults$SVMp <- predict(svmPTune, solTestXtrans)
In [204]:
temp <- sapply(2:ncol(testResults), 
       function(k){
           sqrt(sum((testResults$obs - testResults[,k])^2)/nrow(testResults))})
names(temp) <- colnames(testResults)[2:ncol(testResults)]
temp
Linear_Regression
0.758733014453137
PLS
0.720568230929858
step
0.876958508898966
RLM
0.786980390423408
PCR
0.790923950961849
Ridge
0.719307895394738
Enet
0.701266867536379
NNet
0.69953801158417
avNNet
0.774434973328875
MARS
0.722997885119876
SVMr
0.607087283663978
SVMp
0.618697957196852
Knn
1.06840072606462

Part 7. Regression Tree Models

In [205]:
### R code from Applied Predictive Modeling (2013) by Kuhn and Johnson.
### Copyright 2013 Kuhn and Johnson
###
### Chapter 8: Regression Trees and Rule-Based Models 

################################################################################
### Section 8.1 Basic Regression Trees

### Fit two CART models to show the initial splitting process. rpart 
### only uses formulas, so we put the predictors and outcome into
### a common data frame first.

trainData <- solTrainXtrans
trainData$y <- solTrainY

rpStump <- rpart(y ~ ., data = trainData, 
                 control = rpart.control(maxdepth = 1))
rpSmall <- rpart(y ~ ., data = trainData, 
                 control = rpart.control(maxdepth = 2))

### Tune the model
set.seed(100)
cartTune <- train(x = solTrainXtrans, y = solTrainY,
                  method = "rpart",
                  tuneLength = 25,
                  trControl = ctrl)
# Note: to tune over maximum tree depth, use method = "rpart2" (rec. tuneLength=10)
cartTune
Warning message in nominalTrainWorkflow(x = x, y = y, wts = weights, info = trainInfo, :
“There were missing values in resampled performance measures.”
CART 

951 samples
228 predictors

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 856, 857, 855, 856, 856, 855, ... 
Resampling results across tuning parameters:

  cp           RMSE       Rsquared 
  0.002899268  0.9716246  0.7752447
  0.002992913  0.9660816  0.7779183
  0.003284296  0.9767689  0.7727573
  0.003560157  0.9864913  0.7681005
  0.003857172  0.9978151  0.7627426
  0.004007488  1.0005677  0.7614942
  0.004050192  1.0080565  0.7573802
  0.004224784  1.0154052  0.7534214
  0.004427556  1.0174706  0.7524694
  0.004737500  1.0196066  0.7514764
  0.005204160  1.0230892  0.7499148
  0.006181580  1.0207997  0.7495512
  0.006518655  1.0308117  0.7448964
  0.008288491  1.0464044  0.7369148
  0.008861298  1.0482274  0.7357991
  0.011501953  1.0951525  0.7125953
  0.011781343  1.1017983  0.7085687
  0.015535229  1.1218072  0.6972460
  0.017892740  1.1317681  0.6906429
  0.026503011  1.1747433  0.6680776
  0.047291114  1.2585476  0.6174856
  0.061802695  1.3197449  0.5802679
  0.069715098  1.4025314  0.5265519
  0.137700138  1.5373482  0.4375590
  0.373005064  1.9564323  0.2781173

RMSE was used to select the optimal model using  the smallest value.
The final value used for the model was cp = 0.002992913.
In [206]:
cartTune$finalModel
n= 951 

node), split, n, deviance, yval
      * denotes terminal node

  1) root 951 3979.3030000 -2.71857000  
    2) NumCarbon>=3.776816 317 1114.2000000 -4.48536300  
      4) SurfaceArea2< 0.9776982 74  173.0767000 -6.86783800  
        8) NumNonHAtoms>=2.802901 48   53.5231000 -7.74645800  
         16) NumHydrogen< 2.220711 15   16.4653600 -8.67600000 *
         17) NumHydrogen>=2.220711 33   18.2057900 -7.32393900 *
        9) NumNonHAtoms< 2.802901 26   14.0900300 -5.24576900 *
      5) SurfaceArea2>=0.9776982 243  393.1730000 -3.75983500  
       10) HydrophilicFactor< 0.197302 159  237.7748000 -4.07528300  
         20) MolWeight>=5.72513 60   65.5968800 -4.64950000  
           40) FP077< 0.5 17   18.8918500 -5.39176500 *
           41) FP077>=0.5 43   33.6358300 -4.35604700 *
         21) MolWeight< 5.72513 99  140.4044000 -3.72727300  
           42) FP075< 0.5 55   69.1337300 -4.25290900  
             84) NumRotBonds>=2.094827 8    8.5670000 -5.56500000 *
             85) NumRotBonds< 2.094827 47   44.4497900 -4.02957400 *
           43) FP075>=0.5 44   37.0793000 -3.07022700 *
       11) HydrophilicFactor>=0.197302 84  109.6285000 -3.16273800  
         22) NumMultBonds>=4.57949 28   34.5152700 -3.77892900 *
         23) NumMultBonds< 4.57949 56   59.1661900 -2.85464300  
           46) FP081>=0.5 24   15.5219800 -3.45916700 *
           47) FP081< 0.5 32   28.2953500 -2.40125000 *
    3) NumCarbon< 3.776816 634 1380.8030000 -1.83517400  
      6) MolWeight>=5.372821 104  173.9672000 -3.32846200  
       12) SurfaceArea1< 8.656872 43   51.7177400 -4.24674400  
         24) NumNonHBonds>=3.910809 8    1.7766880 -5.82875000 *
         25) NumNonHBonds< 3.910809 35   25.3426700 -3.88514300 *
       13) SurfaceArea1>=8.656872 61   60.4300200 -2.68114800  
         26) NumOxygen< 1.497866 47   25.8753700 -2.99914900 *
         27) NumOxygen>=1.497866 14   13.8457200 -1.61357100 *
      7) MolWeight< 5.372821 530  929.4181000 -1.54215100  
       14) SurfaceArea2< 0.9776982 118  121.2845000 -2.81500000  
         28) NumBonds>=2.970086 37   11.1113100 -3.96432400 *
         29) NumBonds< 2.970086 81   38.9726000 -2.29000000  
           58) NumNonHBonds>=2.026747 46   12.6950900 -2.68739100 *
           59) NumNonHBonds< 2.026747 35    9.4658170 -1.76771400 *
       15) SurfaceArea2>=0.9776982 412  562.2019000 -1.17759700  
         30) MolWeight>=4.813484 241  241.0830000 -1.74688800  
           60) NumRotBonds>=2.012676 14    3.0914860 -3.28714300 *
           61) NumRotBonds< 2.012676 227  202.7297000 -1.65189400  
            122) NumOxygen< 1.497866 205  153.1710000 -1.76263400 *
            123) NumOxygen>=1.497866 22   23.6190000 -0.62000000  
              246) NumMultBonds>=2.055944 9    0.7426222 -1.58444400 *
              247) NumMultBonds< 2.055944 13    8.7094310  0.04769231 *
         31) MolWeight< 4.813484 171  132.9333000 -0.37526320  
           62) NumCarbon>=2.5076 86   36.7958900 -0.89581400  
            124) FP116< 0.5 78   18.6207500 -1.04076900 *
            125) FP116>=0.5 8    0.5565500  0.51750000 *
           63) NumCarbon< 2.5076 85   49.2558300  0.15141180 *
In [207]:
### Plot the tuning results
plot(cartTune, scales = list(x = list(log = 10)))
In [208]:
# (C) Kuhn and Johnson, 2013
### Use the partykit package to make some nice plots. First, convert
### the rpart objects to party objects.

cartTree <- as.party(cartTune$finalModel)
plot(cartTree)
In [209]:
# (C) Kuhn and Johnson, 2013
### Get the variable importance. 'competes' is an argument that
### controls whether splits not used in the tree should be included
### in the importance calculations.

cartImp <- varImp(cartTune, scale = FALSE, competes = FALSE)
cartImp
rpart variable importance

  only 20 most important variables shown (out of 228)

                  Overall
NumNonHBonds       0.9070
SurfaceArea2       0.7564
NumMultBonds       0.7453
NumCarbon          0.7257
MolWeight          0.6693
NumNonHAtoms       0.6093
NumBonds           0.5871
FP116              0.4788
NumOxygen          0.4706
NumRotBonds        0.3794
SurfaceArea1       0.3554
NumHydrogen        0.3522
FP081              0.2594
FP075              0.2435
FP077              0.1992
HydrophilicFactor  0.1164
FP141              0.0000
FP119              0.0000
FP086              0.0000
FP149              0.0000
In [211]:
### Save the test set results in a data frame                 
testResults$CART = predict(cartTune, solTestXtrans)
In [214]:
### Tune the conditional inference tree
cGrid <- data.frame(mincriterion = sort(c(.95, seq(.5, .99, length = 3))))

set.seed(100)
ctreeTune <- train(x = solTrainXtrans, y = solTrainY,
                   method = "ctree",
                   tuneGrid = cGrid,
                   trControl = ctrl)
ctreeTune
Conditional Inference Tree 

951 samples
228 predictors

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 856, 857, 855, 856, 856, 855, ... 
Resampling results across tuning parameters:

  mincriterion  RMSE       Rsquared 
  0.500         0.9265138  0.7908261
  0.745         0.9259750  0.7904719
  0.950         0.9403788  0.7847694
  0.990         0.9574527  0.7767184

RMSE was used to select the optimal model using  the smallest value.
The final value used for the model was mincriterion = 0.745.
In [215]:
plot(ctreeTune)
In [216]:
ctreeTune$finalModel
	 Conditional inference tree with 38 terminal nodes

Response:  .outcome 
Inputs:  FP001, FP002, FP003, FP004, FP005, FP006, FP007, FP008, FP009, FP010, FP011, FP012, FP013, FP014, FP015, FP016, FP017, FP018, FP019, FP020, FP021, FP022, FP023, FP024, FP025, FP026, FP027, FP028, FP029, FP030, FP031, FP032, FP033, FP034, FP035, FP036, FP037, FP038, FP039, FP040, FP041, FP042, FP043, FP044, FP045, FP046, FP047, FP048, FP049, FP050, FP051, FP052, FP053, FP054, FP055, FP056, FP057, FP058, FP059, FP060, FP061, FP062, FP063, FP064, FP065, FP066, FP067, FP068, FP069, FP070, FP071, FP072, FP073, FP074, FP075, FP076, FP077, FP078, FP079, FP080, FP081, FP082, FP083, FP084, FP085, FP086, FP087, FP088, FP089, FP090, FP091, FP092, FP093, FP094, FP095, FP096, FP097, FP098, FP099, FP100, FP101, FP102, FP103, FP104, FP105, FP106, FP107, FP108, FP109, FP110, FP111, FP112, FP113, FP114, FP115, FP116, FP117, FP118, FP119, FP120, FP121, FP122, FP123, FP124, FP125, FP126, FP127, FP128, FP129, FP130, FP131, FP132, FP133, FP134, FP135, FP136, FP137, FP138, FP139, FP140, FP141, FP142, FP143, FP144, FP145, FP146, FP147, FP148, FP149, FP150, FP151, FP152, FP153, FP154, FP155, FP156, FP157, FP158, FP159, FP160, FP161, FP162, FP163, FP164, FP165, FP166, FP167, FP168, FP169, FP170, FP171, FP172, FP173, FP174, FP175, FP176, FP177, FP178, FP179, FP180, FP181, FP182, FP183, FP184, FP185, FP186, FP187, FP188, FP189, FP190, FP191, FP192, FP193, FP194, FP195, FP196, FP197, FP198, FP199, FP200, FP201, FP202, FP203, FP204, FP205, FP206, FP207, FP208, MolWeight, NumAtoms, NumNonHAtoms, NumBonds, NumNonHBonds, NumMultBonds, NumRotBonds, NumDblBonds, NumAromaticBonds, NumHydrogen, NumCarbon, NumNitrogen, NumOxygen, NumSulfer, NumChlorine, NumHalogen, NumRings, HydrophilicFactor, SurfaceArea1, SurfaceArea2 
Number of observations:  951 

1) MolWeight <= 5.232284; criterion = 1, statistic = 411.977
  2) HydrophilicFactor <= -1.512881; criterion = 1, statistic = 160.078
    3) HydrophilicFactor <= -2.1146; criterion = 1, statistic = 43.254
      4) NumBonds <= 3.091042; criterion = 1, statistic = 30.681
        5)*  weights = 15 
      4) NumBonds > 3.091042
        6)*  weights = 28 
    3) HydrophilicFactor > -2.1146
      7)*  weights = 49 
  2) HydrophilicFactor > -1.512881
    8) MolWeight <= 4.81389; criterion = 1, statistic = 144.708
      9) MolWeight <= 4.511299; criterion = 1, statistic = 50.8
        10) FP063 <= 0; criterion = 1, statistic = 25.371
          11) NumOxygen <= 0.6931472; criterion = 0.994, statistic = 17.686
            12)*  weights = 22 
          11) NumOxygen > 0.6931472
            13)*  weights = 13 
        10) FP063 > 0
          14)*  weights = 20 
      9) MolWeight > 4.511299
        15) SurfaceArea1 <= 0; criterion = 1, statistic = 42.536
          16)*  weights = 19 
        15) SurfaceArea1 > 0
          17) FP116 <= 0; criterion = 1, statistic = 30.967
            18) NumCarbon <= 2.372566; criterion = 1, statistic = 34.641
              19)*  weights = 26 
            18) NumCarbon > 2.372566
              20)*  weights = 82 
          17) FP116 > 0
            21)*  weights = 15 
    8) MolWeight > 4.81389
      22) SurfaceArea1 <= 9.960399; criterion = 1, statistic = 41.242
        23) FP102 <= 0; criterion = 0.999, statistic = 20.931
          24) MolWeight <= 5.113553; criterion = 0.999, statistic = 21.944
            25) FP074 <= 0; criterion = 0.882, statistic = 11.934
              26)*  weights = 84 
            25) FP074 > 0
              27) FP080 <= 0; criterion = 0.969, statistic = 14.505
                28)*  weights = 34 
              27) FP080 > 0
                29)*  weights = 8 
          24) MolWeight > 5.113553
            30)*  weights = 35 
        23) FP102 > 0
          31)*  weights = 9 
      22) SurfaceArea1 > 9.960399
        32) FP168 <= 0; criterion = 1, statistic = 26.394
          33)*  weights = 14 
        32) FP168 > 0
          34)*  weights = 34 
1) MolWeight > 5.232284
  35) SurfaceArea1 <= 0; criterion = 1, statistic = 172.73
    36) NumNonHAtoms <= 2.639057; criterion = 1, statistic = 77.005
      37) NumNonHAtoms <= 2.079442; criterion = 0.999, statistic = 22.23
        38)*  weights = 9 
      37) NumNonHAtoms > 2.079442
        39)*  weights = 20 
    36) NumNonHAtoms > 2.639057
      40) NumNonHAtoms <= 2.833213; criterion = 1, statistic = 40.101
        41)*  weights = 24 
      40) NumNonHAtoms > 2.833213
        42) NumNonHAtoms <= 2.995732; criterion = 0.973, statistic = 14.763
          43) FP015 <= 0; criterion = 0.798, statistic = 10.847
            44)*  weights = 19 
          43) FP015 > 0
            45)*  weights = 10 
        42) NumNonHAtoms > 2.995732
          46)*  weights = 10 
  35) SurfaceArea1 > 0
    47) NumCarbon <= 4.023944; criterion = 1, statistic = 52.853
      48) SurfaceArea1 <= 8.604367; criterion = 1, statistic = 41.425
        49) NumNonHAtoms <= 2.833213; criterion = 1, statistic = 26.847
          50) NumNonHAtoms <= 2.484907; criterion = 0.963, statistic = 14.197
            51)*  weights = 7 
          50) NumNonHAtoms > 2.484907
            52)*  weights = 32 
        49) NumNonHAtoms > 2.833213
          53)*  weights = 9 
      48) SurfaceArea1 > 8.604367
        54) NumOxygen <= 1.386294; criterion = 1, statistic = 30.821
          55) FP072 <= 0; criterion = 0.989, statistic = 16.491
            56)*  weights = 11 
          55) FP072 > 0
            57) MolWeight <= 5.506591; criterion = 0.999, statistic = 21.529
              58)*  weights = 48 
            57) MolWeight > 5.506591
              59)*  weights = 34 
        54) NumOxygen > 1.386294
          60) FP134 <= 0; criterion = 0.924, statistic = 12.789
            61)*  weights = 20 
          60) FP134 > 0
            62)*  weights = 10 
    47) NumCarbon > 4.023944
      63) SurfaceArea1 <= 8.399647; criterion = 0.997, statistic = 19.072
        64) FP142 <= 0; criterion = 0.965, statistic = 14.306
          65)*  weights = 36 
        64) FP142 > 0
          66) NumSulfer <= 0; criterion = 0.909, statistic = 12.442
            67)*  weights = 15 
          66) NumSulfer > 0
            68)*  weights = 10 
      63) SurfaceArea1 > 8.399647
        69) NumHalogen <= 0; criterion = 0.994, statistic = 17.604
          70) FP075 <= 0; criterion = 0.96, statistic = 14.054
            71)*  weights = 44 
          70) FP075 > 0
            72) NumMultBonds <= 3.520562; criterion = 0.825, statistic = 11.139
              73)*  weights = 19 
            72) NumMultBonds > 3.520562
              74)*  weights = 27 
        69) NumHalogen > 0
          75)*  weights = 30 
In [217]:
plot(ctreeTune$finalModel)
In [218]:
testResults$cTree <- predict(ctreeTune, solTestXtrans)

Model tree

Note: I cannot install RWeka on my machine, so I'll skip this section and leave the code for your exploration.

In [ ]:
# (C) Kuhn and Johnson (2013)
################################################################################
### Section 8.2 Regression Model Trees and 8.3 Rule-Based Models

### Tune the model tree. Using method = "M5" actually tunes over the
### tree- and rule-based versions of the model. M = 10 is also passed
### in to make sure that there are larger terminal nodes for the
### regression models.

set.seed(100)
m5Tune <- train(x = solTrainXtrans, y = solTrainY,
                method = "M5",
                trControl = ctrl,
                control = Weka_control(M = 10))
m5Tune
In [ ]:
plot(m5Tune)
In [ ]:
m5Tune$finalModel
In [ ]:
testResults$M5 = predict(m5Tune, solTestXtrans))
plot(m5Tune$finalModel)
In [ ]:
### Show the rule-based model too

ruleFit <- M5Rules(y~., data = trainData, control = Weka_control(M = 10))
ruleFit

Bagging

In [ ]:
# (C) Kuhn and Johnson (2013)
################################################################################
### Section 8.4 Bagged Trees

### Optional: parallel processing can be used via the 'do' packages,
### such as doMC, doMPI etc. We used doMC (not on Windows) to speed
### up the computations.

### WARNING: Be aware of how much memory is needed to parallel
### process. It can very quickly overwhelm the available hardware. The
### estimate of the median memory usage (VSIZE = total memory size) 
### was 9706M for a core, but could range up to 9706M. This becomes 
### severe when parallelizing randomForest() and (especially) calls 
### to cforest(). 

### WARNING 2: The RWeka package does not work well with some forms of
### parallel processing, such as mutlicore (i.e. doMC).
registerDoMC(4)

set.seed(100)
treebagTune <- train(x = solTrainXtrans, y = solTrainY,
                     method = "treebag", nbagg = 25, # 'nbagg' is correct
                     trControl = ctrl)

treebagTune
testResults$Bag = predict(treebagTune, solTestXtrans)

Random Forests

In [246]:
# (C) Kuhn and Johnson (2013)
################################################################################
### Section 8.5 Random Forests

mtryGrid <- data.frame(mtry = floor(c(.65,1,1.5)* ncol(solTrainXtrans)/3))
mtryGrid
mtry
49
76
114

This is no reason to use importance = TRUE during CV.

In [247]:
### Tune the model using cross-validation
set.seed(100)
rfTune <- train(x = solTrainXtrans, y = solTrainY,
                method = "rf",
                tuneGrid = mtryGrid,
                ntree = 200,
                importance = FALSE,
                trControl = ctrl)
rfTune
Random Forest 

951 samples
228 predictors

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 856, 857, 855, 856, 856, 855, ... 
Resampling results across tuning parameters:

  mtry  RMSE       Rsquared 
   49   0.6540550  0.8984074
   76   0.6554385  0.8969173
  114   0.6491238  0.8986294

RMSE was used to select the optimal model using  the smallest value.
The final value used for the model was mtry = 114.
In [248]:
plot(rfTune)
In [249]:
mtryGrid <- data.frame(mtry = floor(c(1.33,1.67)* ncol(solTrainXtrans)/3))
t(mtryGrid)
mtry101126
In [250]:
### Tune the model using cross-validation
set.seed(100)
rfTune <- train(x = solTrainXtrans, y = solTrainY,
                method = "rf",
                tuneGrid = mtryGrid,
                ntree = 500,
                importance = FALSE,
                trControl = ctrl)
rfTune
Random Forest 

951 samples
228 predictors

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 856, 857, 855, 856, 856, 855, ... 
Resampling results across tuning parameters:

  mtry  RMSE       Rsquared 
  101   0.6478208  0.8992970
  126   0.6496097  0.8984445

RMSE was used to select the optimal model using  the smallest value.
The final value used for the model was mtry = 101.
In [252]:
### Tune the model using cross-validation
set.seed(100)
rfTuneImp <- train(x = solTrainXtrans, y = solTrainY,
                method = "rf",
                tuneGrid = data.frame(mtry=101),
                ntree = 500,
                importance = TRUE,
                trControl = ctrl)
rfTuneImp
Random Forest 

951 samples
228 predictors

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 856, 857, 855, 856, 856, 855, ... 
Resampling results:

  RMSE       Rsquared 
  0.6505487  0.8984267

Tuning parameter 'mtry' was held constant at a value of 101
In [254]:
rfImp <- varImp(rfTuneImp, scale = FALSE)
rfImp
rf variable importance

  only 20 most important variables shown (out of 228)

                  Overall
MolWeight          28.511
SurfaceArea1       26.158
SurfaceArea2       24.383
HydrophilicFactor  23.164
NumCarbon          22.438
NumHydrogen        17.220
NumRotBonds        16.794
NumNonHBonds       13.371
NumNonHAtoms       13.301
NumAtoms           13.162
NumMultBonds       12.740
FP075              12.056
NumHalogen         12.014
NumBonds           10.942
NumOxygen          10.672
FP116              10.590
NumNitrogen         9.896
NumChlorine         9.261
FP092               9.151
FP063               8.569
In [255]:
### Tune the model using the OOB estimates
ctrlOOB <- trainControl(method = "oob")
set.seed(100)
rfTuneOOB <- train(x = solTrainXtrans, y = solTrainY,
                   method = "rf",
                   tuneGrid = mtryGrid,
                   ntree = 500,
                   importance = FALSE,
                   trControl = ctrlOOB)
rfTuneOOB
Random Forest 

951 samples
228 predictors

No pre-processing
Resampling results across tuning parameters:

  mtry  RMSE       Rsquared 
  101   0.6515342  0.8985510
  126   0.6470073  0.8999558

RMSE was used to select the optimal model using  the smallest value.
The final value used for the model was mtry = 126.

Conditional inference forests

In [257]:
### Tune the conditional inference forests
set.seed(100)
condrfTune <- train(x = solTrainXtrans, y = solTrainY,
                    method = "cforest",
                    tuneGrid = data.frame(mtry=120),
                    controls = cforest_unbiased(ntree = 500),
                    trControl = ctrl)
condrfTune
Conditional Inference Random Forest 

951 samples
228 predictors

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 856, 857, 855, 856, 856, 855, ... 
Resampling results:

  RMSE       Rsquared 
  0.8196047  0.8454226

Tuning parameter 'mtry' was held constant at a value of 120
In [258]:
plot(condrfTune$results)
Error in plot.train(condrfTune): There are no tuning parameters with more than 1 value.
Traceback:

1. plot(condrfTune)
2. plot(condrfTune)
3. plot.train(condrfTune)
4. stop(plotIt)
In [261]:
set.seed(100)
condrfTuneOOB <- train(x = solTrainXtrans, y = solTrainY,
                       method = "cforest",
                       tuneGrid = data.frame(mtry=c(80,100,120,140)),
                       controls = cforest_unbiased(ntree = 250),
                       trControl = trainControl(method = "oob"))
condrfTuneOOB
Conditional Inference Random Forest 

951 samples
228 predictors

No pre-processing
Resampling results across tuning parameters:

  mtry  RMSE       Rsquared 
   80   0.8124974  0.8498414
  100   0.8080433  0.8509387
  120   0.8096162  0.8508650
  140   0.8083359  0.8503763

RMSE was used to select the optimal model using  the smallest value.
The final value used for the model was mtry = 100.
In [272]:
plotsize(6,5)
plot(condrfTuneOOB$results[,c(3,1)], type='l')

Boosting

In [274]:
# (C) Kuhn and Johnson (2013)
################################################################################
### Section 8.6 Boosting

gbmGrid <- expand.grid(interaction.depth = seq(1, 7, by = 3),
                       n.trees = seq(100, 700, by = 300),
                       shrinkage = c(0.002, 0.02, 0.2),
                      n.minobsinnode = 2)
# The gbmGrid has 27 combinations, and we should get a sense about
#     the direction to take with a finer grid
set.seed(100)
gbmTune <- train(x = solTrainXtrans, y = solTrainY,
                 method = "gbm",
                 tuneGrid = gbmGrid,
                 trControl = ctrl4,
                 verbose = FALSE)
gbmTune
Stochastic Gradient Boosting 

951 samples
228 predictors

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 714, 712, 713, 714 
Resampling results across tuning parameters:

  shrinkage  interaction.depth  n.trees  RMSE       Rsquared 
  0.002      1                  100      1.9129578  0.4564155
  0.002      1                  400      1.6503573  0.6028650
  0.002      1                  700      1.4648315  0.6811530
  0.002      4                  100      1.8135678  0.7399444
  0.002      4                  400      1.3597521  0.7732895
  0.002      4                  700      1.1183703  0.7991781
  0.002      7                  100      1.7860977  0.7843656
  0.002      7                  400      1.2855202  0.8115521
  0.002      7                  700      1.0282755  0.8342559
  0.020      1                  100      1.3262661  0.7190903
  0.020      1                  400      0.9061707  0.8228444
  0.020      1                  700      0.7983993  0.8551121
  0.020      4                  100      0.9806112  0.8209719
  0.020      4                  400      0.6896901  0.8887886
  0.020      4                  700      0.6466676  0.9016317
  0.020      7                  100      0.8836400  0.8513014
  0.020      7                  400      0.6355147  0.9055006
  0.020      7                  700      0.6123395  0.9121944
  0.200      1                  100      0.7796289  0.8578457
  0.200      1                  400      0.6900505  0.8879582
  0.200      1                  700      0.6862969  0.8889503
  0.200      4                  100      0.6684979  0.8947248
  0.200      4                  400      0.6622463  0.8964797
  0.200      4                  700      0.6655329  0.8956103
  0.200      7                  100      0.6473791  0.9018282
  0.200      7                  400      0.6452645  0.9029331
  0.200      7                  700      0.6511246  0.9012470

Tuning parameter 'n.minobsinnode' was held constant at a value of 2
RMSE was used to select the optimal model using  the smallest value.
The final values used for the model were n.trees = 700, interaction.depth =
 7, shrinkage = 0.02 and n.minobsinnode = 2.

Note that tuning these methods is getting complicated. I'm tempted to start using some design of experiments techniques such as a latin square. We can do this by eliminating a lot of the rows from gbmGrid.

In [275]:
plot(gbmTune, auto.key = list(columns = 4, lines = TRUE))
Warning message in draw.key(simpleKey(...), draw = FALSE):
“not enough rows for columns”
In [276]:
gbmGrid <- expand.grid(interaction.depth = seq(3, 9, by = 3),
                       n.trees = seq(300, 1000, by = 350),
                       shrinkage = c(0.004, 0.008, 0.016),
                      n.minobsinnode = 2)
# The gbmGrid has 27 combinations, and we should get a sense about
#     the direction to take with a finer grid
set.seed(100)
gbmTune <- train(x = solTrainXtrans, y = solTrainY,
                 method = "gbm",
                 tuneGrid = gbmGrid,
                 trControl = ctrl4,
                 verbose = FALSE)
gbmTune
Stochastic Gradient Boosting 

951 samples
228 predictors

No pre-processing
Resampling: Cross-Validated (10 fold) 
Summary of sample sizes: 714, 712, 713, 714 
Resampling results across tuning parameters:

  shrinkage  interaction.depth  n.trees  RMSE       Rsquared 
  0.004      3                   300     1.2417789  0.7645253
  0.004      3                   650     0.9501545  0.8185472
  0.004      3                  1000     0.8316373  0.8486517
  0.004      6                   300     1.1183628  0.8179533
  0.004      6                   650     0.8223158  0.8614993
  0.004      6                  1000     0.7229865  0.8825964
  0.004      9                   300     1.0626780  0.8415611
  0.004      9                   650     0.7694816  0.8771974
  0.004      9                  1000     0.6822111  0.8942697
  0.008      3                   300     0.9739097  0.8131254
  0.008      3                   650     0.7784491  0.8629806
  0.008      3                  1000     0.7144877  0.8814423
  0.008      6                   300     0.8496160  0.8556829
  0.008      6                   650     0.6864663  0.8911860
  0.008      6                  1000     0.6444264  0.9028693
  0.008      9                   300     0.7928510  0.8721158
  0.008      9                   650     0.6549492  0.9005658
  0.008      9                  1000     0.6277786  0.9078589
  0.016      3                   300     0.7914139  0.8591269
  0.016      3                   650     0.6865481  0.8897372
  0.016      3                  1000     0.6526446  0.9000259
  0.016      6                   300     0.6937021  0.8894741
  0.016      6                   650     0.6253546  0.9080669
  0.016      6                  1000     0.6084155  0.9129134
  0.016      9                   300     0.6658059  0.8974664
  0.016      9                   650     0.6172006  0.9106297
  0.016      9                  1000     0.6087321  0.9130890

Tuning parameter 'n.minobsinnode' was held constant at a value of 2
RMSE was used to select the optimal model using  the smallest value.
The final values used for the model were n.trees = 1000, interaction.depth =
 6, shrinkage = 0.016 and n.minobsinnode = 2.
In [277]:
plot(gbmTune, auto.key = list(columns = 4, lines = TRUE))
Warning message in draw.key(simpleKey(...), draw = FALSE):
“not enough rows for columns”
In [278]:
gbmImp <- varImp(gbmTune, scale = FALSE)
gbmImp
gbm variable importance

  only 20 most important variables shown (out of 228)

                  Overall
NumCarbon         17472.2
MolWeight         15003.3
SurfaceArea1       6785.8
SurfaceArea2       5183.3
HydrophilicFactor  4394.1
NumNonHAtoms       2093.1
NumAromaticBonds   1692.4
NumMultBonds       1291.9
NumChlorine        1100.5
NumHalogen          830.7
NumHydrogen         769.8
NumRotBonds         702.0
FP075               691.9
NumOxygen           600.2
FP072               537.8
NumAtoms            512.7
FP172               486.3
FP059               413.5
FP112               394.8
FP063               348.4
In [279]:
#save(list=ls(), file="data0623.Rdata")
load("data0623.Rdata")

Advanced rule-based models

In [ ]:
# (C) Kuhn and Johnson (2013)
################################################################################
### Section 8.7 Cubist

cbGrid <- expand.grid(committees = c(1:10, 20, 50, 75, 100), 
                      neighbors = c(0, 1, 5, 9))

set.seed(100)
cubistTune <- train(solTrainXtrans, solTrainY,
                    "cubist",
                    tuneGrid = cbGrid,
                    trControl = ctrl)
cubistTune
In [ ]:
plot(cubistTune, auto.key = list(columns = 4, lines = TRUE))
In [ ]:
cbImp <- varImp(cubistTune, scale = FALSE)
cbImp

Part 8. A regression example data set

In [ ]:
### R code from Applied Predictive Modeling (2013) by Kuhn and Johnson.
### Copyright 2013 Kuhn and Johnson
### Web Page: http://www.appliedpredictivemodeling.com
### Contact: Max Kuhn (mxkuhn@gmail.com)
###
### Chapter 10: Case Study: Compressive Strength of Concrete Mixtures
###
### Required packages: AppliedPredictiveModeling, caret, Cubist, doMC (optional),
###                    earth, elasticnet, gbm, ipred, lattice, nnet, party, pls,
###                    randomForests, rpart, RWeka      
###
### Data used: The concrete from the AppliedPredictiveModeling package
data(concrete)

featurePlot(concrete[, -9], concrete$CompressiveStrength,
            between = list(x = 1, y = 1),
            type = c("g", "p", "smooth"))
In [ ]:
### Copyright 2013 Kuhn and Johnson
################################################################################
### Section 10.1 Model Building Strategy
### There are replicated mixtures, so take the average per mixture
            
averaged <- ddply(mixtures,
                  .(Cement, BlastFurnaceSlag, FlyAsh, Water, 
                    Superplasticizer, CoarseAggregate, 
                    FineAggregate, Age),
                  function(x) c(CompressiveStrength = 
                    mean(x$CompressiveStrength)))

### Split the data and create a control object for train()

set.seed(975)
inTrain <- createDataPartition(averaged$CompressiveStrength, p = 3/4)[[1]]
training <- averaged[ inTrain,]
testing  <- averaged[-inTrain,]

ctrl <- trainControl(method = "repeatedcv", repeats = 5, number = 10)

### Create a model formula that can be used repeatedly

modForm <- paste("CompressiveStrength ~ (.)^2 + I(Cement^2) + I(BlastFurnaceSlag^2) +",
                 "I(FlyAsh^2)  + I(Water^2) + I(Superplasticizer^2)  +",
                 "I(CoarseAggregate^2) +  I(FineAggregate^2) + I(Age^2)")
modForm <- as.formula(modForm)

### Fit the various models

### Optional: parallel processing can be used via the 'do' packages,
### such as doMC, doMPI etc. We used doMC (not on Windows) to speed
### up the computations.

### WARNING: Be aware of how much memory is needed to parallel
### process. It can very quickly overwhelm the available hardware. The
### estimate of the median memory usage (VSIZE = total memory size) 
### was 2800M for a core although the M5 calculations require about 
### 3700M without parallel processing. 

### WARNING 2: The RWeka package does not work well with some forms of
### parallel processing, such as mutlicore (i.e. doMC). 

registerDoMC(14)

set.seed(669)
lmFit <- train(modForm, data = training,
               method = "lm",
               trControl = ctrl)
lmFit
In [ ]:
set.seed(669)
plsFit <- train(modForm, data = training,
                method = "pls",
                preProc = c("center", "scale"),
                tuneLength = 15,
                trControl = ctrl)

plsFit
In [ ]:
lassoGrid <- expand.grid(lambda = c(0, .001, .01, .1), 
                         fraction = seq(0.05, 1, length = 20))
set.seed(669)
lassoFit <- train(modForm, data = training,
                  method = "enet",
                  preProc = c("center", "scale"),
                  tuneGrid = lassoGrid,
                  trControl = ctrl)
lassoFit
In [ ]:
set.seed(669)
earthFit <- train(CompressiveStrength ~ ., data = training,
                  method = "earth",
                  tuneGrid = expand.grid(degree = 1, 
                                         nprune = 2:25),
                  trControl = ctrl)
earthFit
In [ ]:
set.seed(669)
svmRFit <- train(CompressiveStrength ~ ., data = training,
                 method = "svmRadial",
                 tuneLength = 15,
                 preProc = c("center", "scale"),
                 trControl = ctrl)

svmRFit
In [ ]:
nnetGrid <- expand.grid(decay = c(0.001, .01, .1), 
                        size = seq(1, 27, by = 2), 
                        bag = FALSE)
set.seed(669)
nnetFit <- train(CompressiveStrength ~ .,
                 data = training,
                 method = "avNNet",
                 tuneGrid = nnetGrid,
                 preProc = c("center", "scale"),
                 linout = TRUE,
                 trace = FALSE,
                 maxit = 1000,
                 allowParallel = FALSE,
                 trControl = ctrl)
nnetFit
In [ ]:
set.seed(669)
rpartFit <- train(CompressiveStrength ~ .,
                  data = training,
                  method = "rpart",
                  tuneLength = 30,
                  trControl = ctrl)
rpartFit
In [ ]:
set.seed(669)
treebagFit <- train(CompressiveStrength ~ .,
                    data = training,
                    method = "treebag",
                    trControl = ctrl)
treebagFit
In [ ]:
set.seed(669)
ctreeFit <- train(CompressiveStrength ~ .,
                  data = training,
                  method = "ctree",
                  tuneLength = 10,
                  trControl = ctrl)
ctreeFit
In [ ]:
set.seed(669)
rfFit <- train(CompressiveStrength ~ .,
               data = training,
               method = "rf",
               tuneLength = 10,
               ntrees = 1000,
               importance = TRUE,
               trControl = ctrl)

rfFit
In [ ]:
gbmGrid <- expand.grid(interaction.depth = seq(1, 7, by = 2),
                       n.trees = seq(100, 1000, by = 50),
                       shrinkage = c(0.01, 0.1))
set.seed(669)
gbmFit <- train(CompressiveStrength ~ .,
                data = training,
                method = "gbm",
                tuneGrid = gbmGrid,
                verbose = FALSE,
                trControl = ctrl)

gbmFit
In [ ]:
cbGrid <- expand.grid(committees = c(1, 5, 10, 50, 75, 100), 
                      neighbors = c(0, 1, 3, 5, 7, 9))
set.seed(669)
cbFit <- train(CompressiveStrength ~ .,
               data = training,
               method = "cubist",
               tuneGrid = cbGrid,
               trControl = ctrl)
cbFit
In [ ]:
### Turn off the parallel processing to use RWeka. 
registerDoSEQ()


set.seed(669)
mtFit <- train(CompressiveStrength ~ .,
               data = training,
               method = "M5",
               trControl = ctrl)
mtFit
In [ ]:
################################################################################
### Section 10.2 Model Performance

### Collect the resampling statistics across all the models

rs <- resamples(list("Linear Reg" = lmFit, "
                     PLS" = plsFit,
                     "Elastic Net" = lassoFit, 
                     MARS = earthFit,
                     SVM = svmRFit, 
                     "Neural Networks" = nnetFit,
                     CART = rpartFit, 
                     "Cond Inf Tree" = ctreeFit,
                     "Bagged Tree" = treebagFit,
                     "Boosted Tree" = gbmFit,
                     "Random Forest" = rfFit,
                     Cubist = cbFit))

#parallelPlot(rs)
#parallelPlot(rs, metric = "Rsquared")

### Get the test set results across several models

nnetPred <- predict(nnetFit, testing)
gbmPred <- predict(gbmFit, testing)
cbPred <- predict(cbFit, testing)

testResults <- rbind(postResample(nnetPred, testing$CompressiveStrength),
                     postResample(gbmPred, testing$CompressiveStrength),
                     postResample(cbPred, testing$CompressiveStrength))
testResults <- as.data.frame(testResults)
testResults$Model <- c("Neural Networks", "Boosted Tree", "Cubist")
testResults <- testResults[order(testResults$RMSE),]
testResults
In [ ]:
################################################################################
### Section 10.3 Optimizing Compressive Strength

### Create a function to maximize compressive strength* while keeping
### the predictor values as mixtures. Water (in x[7]) is used as the 
### 'slack variable'. 

### * We are actually minimizing the negative compressive strength

modelPrediction <- function(x, mod, limit = 2500)
{
  if(x[1] < 0 | x[1] > 1) return(10^38)
  if(x[2] < 0 | x[2] > 1) return(10^38)
  if(x[3] < 0 | x[3] > 1) return(10^38)
  if(x[4] < 0 | x[4] > 1) return(10^38)
  if(x[5] < 0 | x[5] > 1) return(10^38)
  if(x[6] < 0 | x[6] > 1) return(10^38)
  
  x <- c(x, 1 - sum(x))
  
  if(x[7] < 0.05) return(10^38)
  
  tmp <- as.data.frame(t(x))
  names(tmp) <- c('Cement','BlastFurnaceSlag','FlyAsh',
                  'Superplasticizer','CoarseAggregate',
                  'FineAggregate', 'Water')
  tmp$Age <- 28
  -predict(mod, tmp)
}

### Get mixtures at 28 days 
subTrain <- subset(training, Age == 28)

### Center and scale the data to use dissimilarity sampling
pp1 <- preProcess(subTrain[, -(8:9)], c("center", "scale"))
scaledTrain <- predict(pp1, subTrain[, 1:7])

### Randomly select a few mixtures as a starting pool

set.seed(91)
startMixture <- sample(1:nrow(subTrain), 1)
starters <- scaledTrain[startMixture, 1:7]
pool <- scaledTrain
index <- maxDissim(starters, pool, 14)
startPoints <- c(startMixture, index)

starters <- subTrain[startPoints,1:7]
startingValues <- starters[, -4]

### For each starting mixture, optimize the Cubist model using
### a simplex search routine

cbResults <- startingValues
cbResults$Water <- NA
cbResults$Prediction <- NA

for(i in 1:nrow(cbResults))
{
  results <- optim(unlist(cbResults[i,1:6]),
                   modelPrediction,
                   method = "Nelder-Mead",
                   control=list(maxit=5000),
                   mod = cbFit)
  cbResults$Prediction[i] <- -results$value
  cbResults[i,1:6] <- results$par
}
cbResults$Water <- 1 - apply(cbResults[,1:6], 1, sum)
cbResults <- subset(cbResults, Prediction > 0 & Water > .02)
cbResults <- cbResults[order(-cbResults$Prediction),][1:3,]
cbResults$Model <- "Cubist"
In [ ]:
### Do the same for the neural network model

nnetResults <- startingValues
nnetResults$Water <- NA
nnetResults$Prediction <- NA

for(i in 1:nrow(nnetResults))
{
  results <- optim(unlist(nnetResults[i, 1:6,]),
                   modelPrediction,
                   method = "Nelder-Mead",
                   control=list(maxit=5000),
                   mod = nnetFit)
  nnetResults$Prediction[i] <- -results$value
  nnetResults[i,1:6] <- results$par
}
nnetResults$Water <- 1 - apply(nnetResults[,1:6], 1, sum)
nnetResults <- subset(nnetResults, Prediction > 0 & Water > .02)
nnetResults <- nnetResults[order(-nnetResults$Prediction),][1:3,]
nnetResults$Model <- "NNet"
In [ ]:
### Convert the predicted mixtures to PCA space and plot

pp2 <- preProcess(subTrain[, 1:7], "pca")
pca1 <- predict(pp2, subTrain[, 1:7])
pca1$Data <- "Training Set"
pca1$Data[startPoints] <- "Starting Values"
pca3 <- predict(pp2, cbResults[, names(subTrain[, 1:7])])
pca3$Data <- "Cubist"
pca4 <- predict(pp2, nnetResults[, names(subTrain[, 1:7])])
pca4$Data <- "Neural Network"

pcaData <- rbind(pca1, pca3, pca4)
pcaData$Data <- factor(pcaData$Data,
                       levels = c("Training Set","Starting Values",
                                  "Cubist","Neural Network"))

lim <- extendrange(pcaData[, 1:2])

xyplot(PC2 ~ PC1, 
       data = pcaData, 
       groups = Data,
       auto.key = list(columns = 2),
       xlim = lim, 
       ylim = lim,
       type = c("g", "p"))